Newsgroups: comp.parallel
From: mccalpin@brahms.udel.edu (John D McCalpin)
Subject: Crisis in HPC (Was:Workshop on problems in HPC - London ...)
Summary: per-node memory bandwidth is inadequate
Organization: College of Marine Studies, U. Del.
Date: Wed, 2 Aug 1995 00:07:34 GMT
Message-ID: <3v6782$ktd@brahms.udel.edu>

In article <48@mint.ukc.ac.uk>,  <P.H.Welch@ukc.ac.uk> wrote:
>
>          Crisis in High Performance Computing - A Workshop
>          -------------------------------------------------
>
>State-of-the-art high performance computers are turning in what some
>observers consider woefully low performance figures for many user
>applications.  [...]
>Efficiency levels for ``real'' HPC applications are reported (e.g.
>by the NAS parallel benchmarks) ranging around 20-30% (for some 16-node
>systems) to 10-20% (for 1024-node massively parallel super-computers).
>Are low efficiencies the result of bad engineering at the application
>level (which can be remedied by education) or bad engineering at the
>architecture level (which can be remedied by <what>)?  

Most applications that I have seen have poor performance levels
as noted above.  These fall into two categories:

(1) Limited by scalability/latency effects

(2) Limited by poor single cpu performance

The latter case is very commonly due to memory bandwidth limitations
on the RISC processors used to build scalable parallel machines.
This is very clearly documented for a wide variety of systems,
including some of the big parallel machines, in my report on
the STREAM benchmark:

	http://perelandra.cms.udel.edu/~mccalpin/hpc/stream

The only "cure" is to switch to algorithms with much better cache
re-use, but which maintain fairly large granularity.  In fluid
dynamics, this means dropping finite difference and low order finite
volume schemes and switching to high order finite element or spectral
element schemes.  In my area of ocean modelling, we see >80 MFLOPS per
SP2 node for a spectral finite element code, while our finite
difference competitors are seeing 8-10 MFLOPS per node on T3D or CM-5
machines.

I should note that the next generation of cpus will have significantly
enhanced bandwidth, but in the context of the huge increases in cpu
performance, the overall balance is not going to be significantly
improved --- you can still expect to see 10% efficiencies, but it will
be 10% of 400-600 MFLOPS instead of 10% of 100-150 MFLOPS.

(For any readers in New Mexico, I will be speaking on this topic
at Los Alamos National Lab on Friday July 28 at 9am in the 
Center for Nonlinear Science Conference room.)
-- 
--
John D. McCalpin		 mccalpin@perelandra.cms.udel.edu
Assistant Professor, College of Marine Studies, Univ. of Delaware 
	http://perelandra.cms.udel.edu/~mccalpin


