Newsgroups: comp.parallel
From: rcarter@best.com (Russell Carter)
Subject: Re: Crisis in HPC (Was:Workshop on problems in HPC - London ...)
Organization: Best Internet Communications, Inc. (info@best.com)
Date: Thu, 3 Aug 1995 03:17:39 GMT
Message-ID: <3vpf4j$psn@shell2.best.com>

In article <3v6782$ktd@brahms.udel.edu>,
John D McCalpin <mccalpin@brahms.udel.edu> wrote:
>In article <48@mint.ukc.ac.uk>,  <P.H.Welch@ukc.ac.uk> wrote:
>>
>>          Crisis in High Performance Computing - A Workshop
>>          -------------------------------------------------
>>
>>State-of-the-art high performance computers are turning in what some
>>observers consider woefully low performance figures for many user
>>applications.  [...]
>>Efficiency levels for ``real'' HPC applications are reported (e.g.
>>by the NAS parallel benchmarks) ranging around 20-30% (for some 16-node
>>systems) to 10-20% (for 1024-node massively parallel super-computers).
>>Are low efficiencies the result of bad engineering at the application
>>level (which can be remedied by education) or bad engineering at the
>>architecture level (which can be remedied by <what>)?  
>
>Most applications that I have seen have poor performance levels
>as noted above.  These fall into two categories:
>
>(1) Limited by scalability/latency effects
>
>(2) Limited by poor single cpu performance
>
>The latter case is very commonly due to memory bandwidth limitations
>on the RISC processors used to build scalable parallel machines.
>This is very clearly documented for a wide variety of systems,
>including some of the big parallel machines, in my report on
>the STREAM benchmark:
>
>	http://perelandra.cms.udel.edu/~mccalpin/hpc/stream
>
>The only "cure" is to switch to algorithms with much better cache
>re-use, but which maintain fairly large granularity.  In fluid
>dynamics, this means dropping finite difference and low order finite
>volume schemes and switching to high order finite element or spectral
>element schemes.  In my area of ocean modelling, we see >80 MFLOPS per
>SP2 node for a spectral finite element code, while our finite
>difference competitors are seeing 8-10 MFLOPS per node on T3D or CM-5
>machines.

Ah yes, but different algorithms, which these most surely are, have
differing amounts of flops-to-solution, and different "typical" performance
rates, even on a Cray C-90, and so comparing MFLOP rates is a trifle
ingenuous, wouldn't you say?

What difference does it make if your algorithm goes like a hurricane
in the functional units but takes longer to reach the solution than something
more sophisticated that sort of breezes along, getting to the desired
point sometimes 2-4x faster?

MFLOP rates really aren't the bottom line: it's time-to-solution.

(examples abound, even in the seemingly homogeneous "field" of
 finite difference computations.  For three quite standard approaches,
 with radically different mem/flop-rate trade-offs, see any NAS
 Parallel Benchmark Report, and look at the sp, bt, and lu statistics.)

>
>I should note that the next generation of cpus will have significantly
>enhanced bandwidth, but in the context of the huge increases in cpu
>performance, the overall balance is not going to be significantly
>improved --- you can still expect to see 10% efficiencies, but it will
>be 10% of 400-600 MFLOPS instead of 10% of 100-150 MFLOPS.

If the code is DRAM memory bandwidth limited now, it will be so with
these newer cpus, and the performance increases will be modest.
Not so for cache based algorithms, such as the level-3 blas.

Best regards,
Russell
http://www.geli.com

>--
>John D. McCalpin		 mccalpin@perelandra.cms.udel.edu
>Assistant Professor, College of Marine Studies, Univ. of Delaware 
>	http://perelandra.cms.udel.edu/~mccalpin
>




