Newsgroups: comp.parallel
From: P.H.Welch@ukc.ac.uk
Subject: Is parallel efficiency relevant?
Organization: University of Kent at Canterbury, UK.
Date: 5 Sep 1995 14:44:06 GMT
Message-ID: <42hnnm$7ip@usenet.srv.cis.pitt.edu>

[This may interest readers of this newsgroup.  For those who can get
to it, earlier parts of this discussion may be found from Article 24
onwards in uk.org.epsrc.hpc.discussion ... ]


In Article 31 of uk.org.epsrc.hpc.discussion, cliff@liverpool.ac.uk
(Dr C. Addison) writes:


> ...................................  I suspect that those people who took
> the time and trouble to rewrite their code in occam from Fortran / C and then
> added message passing for transputers probably are not getting as poor a 
> parallel performance on the T3D etc. as many others are.


I think the people who had most success with transputers/occam were those
fortunate enough to have had a problem for which previous Fortran/C code
did *not* exist.  You needed to think about your problem in terms of
communicating processes from the start - i.e. the computational and
communication aspects of the code were developed together as part of
an integrated parallel design.  Translating from serial Fortran/C and
then adding message-passing means that you are very late in the design
cycle before starting to think parallel.  Nevertheless, the discipline
imposed by transputers/occam probably had great benefit in simplifying
things ... and such benefits should remain when taking those designs on
to machines like the T3D ... i.e. I think Cliff is right.

What worries me is that lessons like think-parallel-from-the-start are
being forgotten.  I suspect that many PVM/MPI-using applications still
write/inherit serial code describing computations and then add the
parallel distribution and message-passing afterwards.  If the latter
is not designed with the former, it becomes a low-level aspect of the
overall design ... and hence is obscure, hard to reason about,
perceived to be "difficult" and may be inefficient.

We then get told that we (the application programmers) should avoid
thinking parallel altogether and rely on very clever compilers (assisted,
perhaps, by data-parallel annotations and loops - e.g. HPF) to generate
the necessary message-passing automatically.  Maybe that will (in time)
work with an acceptable efficiency, but I do get depressed about it ...

I *like* thinking parallel - I find it very natural.  I don't want to
confine the logic in my algorithm to serial invocations of serial or
low-level data-parallel operations.  I want to be able to use parallel
logic at *any* level of design.  Of course, the parallelism I want to
express in my design is that determined by my application and *not*
the parallelism determined by my MPP hardware!  If I can do that, my
logic will be (much) simpler than the equivalent logic I will have
to come up with if I am constrained to use only serial code.  If the
logic is simpler, it will be faster to develop, easier to maintain,
I'll have more confidence in its correctness (especially if the
language used to express it has a clean semantics and is backed up
by a simple and rich mathematical model ... occam/CSP?) *and* there
should be a better chance of efficient execution.

But how do we reconcile the application-oriented parallelism (I nearly
said object-oriented parallelism ... but that's maybe not so different?)
in the software with the usually very different parallelism in the
hardware on which we happen (today) to be trying to execute???  That
seems to me like an interesting "grand challenge" for HPC software
and hardware architects to solve.  Otherwise, we may be in danger
of hearing (in a couple of years?) the excited announcement of the
procurement of the fabled TeraFLOP/s machine ... only to find we can
only get 1% from it ...


> HOWEVER, recall that on transputers, one took a considerable performance hit
> if you stayed with Fortran and C, to which you then needed to add the usual
> hit from using a distributed memory system.


With the early C compilers for transputers, that was the case.  Now,
the INMOS C and occam compilers share the same code-generator and produce
very similarly performing code (although I think occam is still a bit
lighter in its message-passing latency ... which can be significant).
I believe other vendors' C compilers are also very good.  I don't
know about Fortran on transputers.


> What is the baseline for general comments of a crisis? Poor single processor
> performance? Poor (virtual) shared memory performance? Poor message
> passing performance or what?


We have three types of resource here: processor, memory and communications.
They need to be in balance.  Poor performance on *any* one prevents
obtaining benefit from the others and is a cause for concern.


> What sort of change has occurred in what is defined as HPC in the last 
> 5 years? In 1990, what sort of memory and expected performance rate could
> a large number of applications expect from an HPC system? What can be
> expected today, even taking into account that expected performance and peak
> performance can vary by an order of magnitude? My guess is that a lot more
> useful science can be done on today's systems than could be done in 1990.


When/if we get the 1% TeraFLOP/s machine, the same will be said in its
defence compared with the machines of 1995 ...


> I am also willing to speculate that the bottleneck that restricts performance
> has shifted from raw floating point performance to memory latency and
> bandwidth.


Certainly, the bottleneck is not floating-point performance ... around 85%
seems to be being unused!


> .........  In other words, things are a lot more complicated when trying to
> assess performance etc. and wild claims about poor efficiencies on single
> processors that are based on comparing achieved vs. peak performance are
> comparing apples and oranges. The real question is what is the best
> performance one can expect from an application given its memory access
> and computational characteristics?


The memory access and computational characteristics of the applications
haven't changed over the past eight years ... only the technology has
progressed (regressed?).

Where do we draw the line on efficiency?  We seem to be getting around
15% of a 40 GFLOP/s machine today.  What about a next generation - do
we have any assurance we will get 15% of a 400 GFLOP/s machine tomorrow?
What if we only get 5% - will that be acceptable - after all it would
still get us about treble the throughput of our current machine?  Or
would we *then* be entitled to complain and ask for our 15% back?  If so,
why can't we ask *now* for our 50% (plus) back from the good-old-days?
I don't see that it's comparing apples and oranges to ask ...


> .................................  If the sustained performance is a lot less
> than this figure, then something is wrong. If it is not, but performance is
> poor, then one is pushed towards thinking in terms of new algorithms that
> optimise locality of reference


I'd like to avoid having to consider such low-level details of the hardware
when developing algorithms ... caches that we have to design our programs
around sound a worse nightmare than vector-processors ...


> .............................. or just accepting that your application is
> limited by some other architectural characteristic than its floating point
> performance. 


If the majority of applications are like that, then why have machines with
such fast floating-point units?  Why not build cheaper machines with modest
floating-point units that the memory and communication latencies and
bandwidths can sustain?  Can't we get the same throughput for our
applications at much lower cost?  I don't know ... 


> Just a few comments. Really this is better as a discussion over beer than
> over the internet!


I'll be searching out a pub immediately following the workshop on the 11th!

Peter Welch.

