Newsgroups: comp.parallel,comp.parallel.mpi,comp.parallel.pvm
From: gale@wind.hpc.pko.dec.com (Israel Gale)
Subject: Digital HPF/f90--performance numbers
Organization: Digital Equipment Corporation
Date: 03 May 1996 10:44:28 -0400
Message-ID: <4mkr07$oj@usenet.srv.cis.pitt.edu>

Here is some performance data on programs compiled with Digital's
HPF/Fortran 90 compiler.  These programs were submitted to us by other
parties; they were not written by Digital.  I didn't want to waste
bandwidth by posting lengthy source codes; I will gladly send them to
anyone who is interested (write to gale@hpc.pko.dec.com). 

You will probably notice that many of the speed-ups are super-linear
(greater than the number of processors).  We attribute this mainly to
cache effects: N processors have N times the cache of 1 processor;
when you keep the problem size fixed, but add processors, the
additional cache contributes (sometimes significantly) to speedup. 


1) Conjugate-gradient solver for an M.I.T parallel Navier-Stokes ocean model
----------------------------------------------------------------------------
This code was submitted to us as an entry in an HPF contest we
sponsored together with the Pittsburgh Supercomputing Center.  We ran
the code on both SMP and workstation cluster hardware:

                                       +-----------------------------------+
  +-----------------------------------+| Workstation Cluster (DEC 3000/700 |
  |        SMP (AlphaServer 8400,     || 225Mhz EV45 cpu, 2MB B-cache,     |
  |      350Mhz EV5 cpu, 4MB cache)   || GIGAswitch/FDDI crossbar switch)  |
  +------------+------------+---------++------------+------------+---------+
  | Processors | Time(secs) | Speedup || Processors | Time(secs) | Speedup |
  +------------+------------+---------++------------+------------+---------+
  |     1      |   171.88   |   1.0   ||     1      |   433.52   |    1.0  |
  |     2      |    71.13   |   2.4   ||     2      |   192.21   |    2.2  |
  |     3      |    32.81   |   5.2   ||     3      |   133.92   |    3.2  |
  |     4      |    19.42   |   8.8   ||     4      |    95.72   |    4.5  |
  |     5      |    15.29   |  11.2   ||     5      |    62.06   |    7.0  |
  |     6      |    12.92   |  13.3   ||     6      |    54.90   |    7.9  |
  +------------+------------+---------+|     7      |    48.66   |    8.9  |
                                       |     8      |    38.34   |   11.3  |
                                       +------------+------------+---------+



2) Digital HPF versus PVM on red-black relaxation
-------------------------------------------------
We compared HPF and PVM versions of a red-black 3-d finite difference
solver.  Both are based on codes made available as part of the suite
of GENESIS distributed-memory benchmarks.  The HPF version performs
somewhat better than the PVM version.  These timings were done on a
cluster made up of DEC 3000/900 workstations (275Mhz EV45 cpu, 2MB
B-cache, GIGAswitch/FDDI crossbar switch). 

  +-----------------------------------+  +-----------------------------------+
  |                PVM                |  |       Digital HPF/Fortran 90      |
  +------------+------------+---------+  +------------+------------+---------+
  | Processors | Time(secs) | Speedup |  | Processors | Time(secs) | Speedup |
  +------------+------------+---------+  +------------+------------+---------+
  |    1       |    510     |   1.0   |  |    1       |    510     |   1.0   |
  |            |            |         |  |            |            |         |
  |    2       |    299     |   1.7   |  |    2       |    274     |   1.9   |
  |            |            |         |  |            |            |         |
  |    4       |    144     |   3.5   |  |    4       |    131     |   3.9   |
  |            |            |         |  |            |            |         |
  |    8       |     76     |   6.7   |  |    8       |     67     |   7.6   |
  +------------+------------+---------+  +------------+------------+---------+



3) An atmospheric flow problem example ("Shallow water", 2D finite-difference)
------------------------------------------------------------------------------

  +-----------------------------------+  +-----------------------------------+
  |        SMP (AlphaServer 8400,     |  | Workstation Cluster (DEC 3000/900 |
  |      300Mhz EV5 cpu, 4MB cache)   |  | 275Mhz EV45 cpu, 2MB B-cache,     |
  +------------+------------+---------+  | GIGAswitch/FDDI crossbar switch)  |
  | Processors | Time(secs) | Speedup |  +------------+------------+---------+
  +------------+------------+---------+  | Processors | Time(secs) | Speedup |
  |     1      |     15     |   1.0   |  +------------+------------+---------+
  |     2      |    8.4     |   1.8   |  |     1      |     23     |   1.0   |
  |     3      |    4.3     |   3.5   |  |     2      |     14     |   1.6   |
  |     4      |    3.1     |   4.8   |  |     4      |	   6.8     |   3.4   |
  |     8      |    1.6     |   9.4   |  |     8      |    2.7     |   8.5   |
  +------------+------------+---------+  +------------+------------+---------+

Interestingly, single-processor comparisons were also done between
standard Fortran 77 implementations of Shallow water and the Fortran 90
array-syntax implementation used above for the multiprocessor tests.  It
turns out that the Fortran 90 array-codes execute somewhat faster (12%) on
a Digital AlphaServer 8400 system than the Fortran 77 loop-based code.  A
similar comparison on an Alphastation 3000/900 reveals that the two
implementations have nearly identical (within 1%) execution times.


-Israel Gale
 gale@hpc.pko.dec.com

-Jonathan Harris
 jharris@hpc.pko.dec.com

