Newsgroups: comp.parallel
From: marc@efn.org (Marc Baber)
Subject: Re: APR xHPF 2.1 Released; NAS Parallel Benchmark Results
Organization: UTexas Mail-to-News Gateway
Date: 5 Sep 1995 14:42:35 GMT
Message-ID: <42hnkr$7hc@usenet.srv.cis.pitt.edu>

APR NAS BENCHMARK RESULTS -- ADDENDUM AND CORRECTIONS
August 11, 1995
=========================================================================

On July 5th, we posted NAS Parallel Benchmark results for APR's xHPF
compilation system for several platforms.  This posting responds to
questions raised by that report.

First, our apologies to Cray Research for forgetting a couple compiler
optimization switches on the C90 sequential timings.  This was our
"control" timing.  It was not parallelized and was not compiled with
xHPF.  We realized approximately 2x-5x improvements over what was
previously reported by using these two cf77 switches.  The updated
performance tables appear below.  (Thanks to Charles Grassl of CRI
(cmg@cray.com) for bringing this to our attention).

Second, yes the numbers were for Class A results (smaller problem
size), so, as David Coster (dpc@ipp-garching.mpg.de) rightly pointed
out, our wall-clock timings are still 3.95x-13.9x the highly-hand-tuned
vendor timings for the BT, EP and SP benchmarks (even up to 32.61x for
the FT benchmark).  While xHPF 2.1 can't compete with vendor's
hand-parallelized and hand-tuned codes, we do claim xHPF has a clear
price-performance advantage for portable, compile-and-run codes (see
new tables below).

We at APR explain the difference between our timings and the vendor
timings as follows:

    1. The vendor timings are based on highly hand-tuned codes that
    extensively use libraries whereever permitted and aren't
    portable.  (On this point there was some on-line debate with
    Patrick F. McGehearty (patrick@convex.COM) and Jeff Mohr
    (mohr_j@access2.digex.net) generally concurring and Robert Gale
    (gale@wind.hpc.pko.dec.com) dissenting.) APR's xHPF numbers
    were obtained by running the same portable Fortran (HPF77) code
    on all the measured platforms.  We would like to see the vendor's
    codes for these benchmarks, if they're ever published.

    2. APR's xHPF is a source-to-source parallelizer and produces
    an F77 SPMD program with message passing as it's output.  The
    parallel timings are completely dependent upon the
    sophistication of the vendor's F77 compilers for per-processor
    performance.  Some vendor's F77 compilers are far better than
    others' at delivering the full performance of the processor.
    In particular, APR depends on the F77 compilers' handling of
    linearized array subscripts.  Some F77 compilers handle
    linearized subscripts as easily as regular subscripts and some
    don't.

    3. And, yes, xHPF still has plenty of room for improvement, and
    will continue to improve over time.

Q:  Given the differences in performance between xHPF and vendor
    timings, why would APR publish these results?

A:  Because, we figure that the market for portable compile-and-run
    speed is much larger than the market for non-portable hand-tuned
    speed.

    Just as cheap memory has created a climate where very few care about or
    discuss memory efficiency, cheap microprocessors are creating a climate
    where very few will be interested in processor efficiency.  What people
    WILL still be interested in is compile-and-run portability and price-
    performance.  And in that arena, xHPF shines.

    Even with the 2x improvement in C90 sequential performance (from better
    compiler switches), APR's xHPF still delivers 2x-10x better
    price-performance on the T3D and IBM-SP2 than cf77 delivers on the Cray
    C-90.  (see the SP, BT and EMBAR benchmarks below).  In other words,
    most people have already come to expect better price-performance from
    parallel systems if they're will to pay the price of re-programming
    their applications by hand.  The point APR is making is that better
    price-performance can be achieved on parallel architectures without
    reprogramming.  The tables below show the $$/Megaflop *** numbers
    for each program and hardware configuration.  


SP                102000 M-Ops

                                  APR xHPF  Vendor  APR/VendoAPR xHPF APR xHPF
Platform          Mega$  N Procs  Time-Sec.Time-Sec.  Ratio   Mflops  $$/MFlop

Cray C90         1.90625        1     1430    174.5     8.19    71.33 26724.88

Cray T3D            0.45       16     2368   202.11   11.716    43.07 10447.06
                     0.9       32     1353    104.1   12.997    75.39 11938.24
                     1.8       64      728    53.26    13.67   140.11 12847.06

IBM SP2-Wide       1.485       16      576     83.2     6.92   177.08  8385.88
                    2.97       32      320     48.7     6.57   318.75  9317.65 
                    5.94       64      192     30.1     6.38   531.25 11181.18



BT                    181300 M-Ops

                                  APR xHPF  Vendor  APR/VendoAPR xHPF APR xHPF
Platform          Mega$  N Procs  Time-Sec.Time-Sec.  Ratio   Mflops  $$/MFlop

Cray C90         1.90625        1     4975    276.8    17.97   36.442 52308.85

Cray T3D            0.45       16     1958   230.41    8.498   92.594  4859.90
                     0.9       32     1044   115.53    9.037  173.659  5182.57
                     1.8       64      551    59.01    9.337  329.038  5470.49

IBM SP2-Wide       1.485       16      446    112.9    3.950    406.5  3653.12 
                    2.97       32      245     61.8    3.964      740  4013.51
                    5.94       64      164     34.7    4.726   1105.5  5373.19



EMBAR                  26680 M-Ops

                                  APR xHPF  Vendor  APR/VendoAPR xHPF APR xHPF
Platform          Mega$  N Procs  Time-Sec.Time-Sec.  Ratio   Mflops  $$/MFlop

Cray C90         1.90625        1   278.84    36.62     7.61     95.7 19922.74

Cray T3D            0.45       16      100    22.74     4.40    266.8  1686.66 
                     0.9       32       50    11.37     4.40    533.6  1686.66 
                     1.8       64       25     5.68     4.40   1067.2  1686.66 

IBM SP2-Wide       1.485       16       79     9.95     7.94    337.7  4397.11
                    2.97       32       40     4.98     8.03      667  4452.77
                    5.94       64       23     2.49     9.24     1160  5120.69


*** System prices are from "NAS Parallel Benchmarks Results 3-95;
Report NAS-95-011, April 1995 by Subhash Saini and David H Bailey"

Note: Paragon results omitted because I did not find list price data
in the above report for the Paragon.

           ___   _____    _____
__________/==|__|==__=\__|==__=\     Applied Parallel Research, Inc.
_________/===|__|=|__\=\_|=|__\=\    1723 Professional Drive
________/=/|=|__|=|__/=/_|=|__/=/    Sacramento, CA 95825
_______/=/_|=|__|==___/__|====_/_______________________________________________
______/=___==|__|=|______|=|\=\________________________________________________
_____/=/___|=|__|=|______|=|_\=\_______________________________________________
    /_/    |_|  |_|      |_|  \_\    
Voice:     (916)481-9891             E-mail:    support@apri.com
FAX:       (916)481-7924             APR Web Page: http://www.infomall.org/apri/
-------------------------------------------------------------------------------

