Newsgroups: comp.parallel
From: Zhiwei Xu <zxu@monalisa.usc.edu>
Subject: Is SP2 the best supercomputer? (Long)
Organization: University of Southern California, Los Angeles, CA
Date: 16 Apr 1995 22:49:02 -0700
Message-ID: <3n36u8$js6@usenet.srv.cis.pitt.edu>

The subject line is not meant to be a flame baiter. All the standard
disclaimers apply.

Last October, when I was attending an SP2 workshop at Maui
High-Performance Computing Center (MHPCC), I heard one knowledgeable
guy state that "SP2 is the best supercomputer".

Now after using SP2 for more than 7 months, somtimes intensively, I
would like to share my experience and opinions with you. This is not
meant to be for or against IBM or any supercomputer companies. Rather,
I think we users should let the vendors know what we need, what exists
already, and what is missing.  We users should exchange experience so
that repeated mistakes can be avoided. 

I hope this post will stimulate discussions on

	What important features are already there, but should be pointed
		out so that other users can take advantage of them

	What important features are missing

	What are the most frequent mistakes a use can make

I'll summarize the responses.

PART I. Good Things about SP2

First of all, I like SP2. The detailed reasons are listed below.  (But
let me hasten to add that I would not go so far as to say that "SP2 is
the best supercomputer". For one thing, I happen to think the Cray T3D
has a very nice architecture, with many features (e.g., eureka,
prefetch-buffer, the swap hardware) yet to be fully exploited by the
software and users. There are features missing in SP2 but available in
other systems, too. Besides, my experience in other systems is limited.)

1. The NOW (Network of Workstations) concept

I use the term NOW for lack of a better alternative (not the Berkeley NOW). 
This concept contains the following attributes, among other things:

(1)	Each node is a full-fledged workstation, with its own disk and a
	complete multi-user, multi-tasking OS.
(2)	Multiple tasks, either from the same application or from different
	users, can be executed on the same node at the same time.
(3)	An application can be invoked from a host to run on multiple hosts,
	where a host can be either an SP2 node at Maui or a RS/6000 workstation
	in my office (or anywhere on the net).
(4)	The hosts are connected, either through ethernet or IBM's
	High-Performance Switch (HPS).
(5)	All messages during an execution can be automatically routed to the
	invoking host, with a tag showing the originating host.

The main benefit is to facilitate program development. I can first develop a
parallel code on my PowerPC 601 workstation. Because the limited memory, the
parallel code might be a scaled-down one. Then I can test the exactly same code
on SP2 in interactive mode and extend it to a full-scale code (by simply
changing a few constant values in a .h file) to test on multiple nodes. The
resource is almost always available, due to the multi-user feature. My parallel
code is executing slowly, but I am mainly concerned about correctness at this
stage. After the code is fully debugged, I can run the production code
in batch-mode through the Loadleveler, or even in dedicated mode.

A few months ago, I talked to a guy in a computing center, which was deciding
whether to buy an SP2. He did not like this NOW concept, because it is too
wasteful: Letting the same, complete AIX occupy each node disk and duplicate
in each node memomy does not seem to make sense. They then purchased a system
from another company. I think it might be wasteful, but it surely makes the
user's life easier. This NOW concept presents a familiar workstation-like
environment. I don't have to worry about getting and releasing resources.
I can use the familiar Unix functionalities. Furthermore, the AIX has been
there for years and went through many testings and revisions. In other words,
it is a mature, time-tested product, thus reliable (see Item 2 below).

BTW, I recently heard the afore-mentioned computing center bought an SP2.

Another hint of IBM People having tried to make the system more general-purpose
is in the way they designed their Message-Passing Library (MPL), which
in my opinion, is not optimized for the HPS. Instead, they tried to implement
it to have overall good performance on any interconnect. Contact Howard Ho
(ho@almaden.ibm.com) for their IEEE TOPDP paper.

2. Reliability
The Maui SP2 is reliable. Most of the time, we have no problem with the SP2
system, when less than 128
nodes are used. All errors we first thought strange were finally traced back
to bugs in our code. Our longest continuous use of SP2 was for three days.
The system behaved during that time.

However, when more than 128 nodes (e.g., 256 nodes) were used, we experienced
various weird behaviors. E.g., our tasks were randomly killed, no reason given.
Fortunately, all these strange behaviors were transient. We just ran the same
code again and again (no need to change any thing), and eventually the system
computed the correct result.


3. Good single node performance

One SP2 node has a peak speed of 266 MFLOPS.
Our application achieved a sustained ~100 MFLOPS on one SP2 node. The slowest
component got 5 MFLOPS, the fastest got 200 MFLOPS. 

These were achieved using C code, without assembly or library (e.g., ESSL), or
even modifying the original C source. The application was compiled by
cc -O3 -qarch=pwr2 *.c -o executable

The original C code of our application was not written for the current
generation of superscalar, cached processors, such as the POWER2 on SP2.

One issue is how to represent an array of complex numbers. Two popular
methods are shown below:

/****** Method (1) Array of Structure *********/
typedef struct { double real,image; } COMPLEX;
COMPLEX data[N1][N2][N3] ;

/****** Method (2) Separate Arrays *********/
double data_real[N1][N2][N3] ;
double data_image[N1][N2][N3] ;

Our experience is that Method (1) should always be used, because it has much
smaller cache miss ratio. In our application, the slowest code (5 MFLOPS) used
method (2), of which we computed the cache miss ratio to be as high as 33%.
After changing to method (2), the speed improved to 20 MFLOPS, with the cache
miss ratio dropped to 3%.

4. Easy and useful MPL

Programming message passing machines is difficult. But MPL makes it easier
by supporting both point-to-point any collective communications. The
documentation of MPL is concise but clear, and easy to learn.

Our application used the following communications routines, in decreasing
order of importance (to us):

barrier
index (total exchange, or all-to-all personalized communication)
reduction
shift
broadcast
blocking send/receive
scatter
gather

Another nice feature of the MPL is that we never need to worry about
how large the message is. We don't have to allocate buffers. We used
point-to-point messages as long as 16 MB and indexed 100 MB without any
trouble.

One fact we found is that one should always try to use collective
communication. Avoid using a sequence of send/receive to simulate a more
powerful primitive, because it is not only error-prone, but also much slower.
(2 to 30 times slower in our experience)

Another technique we found useful is to first develop a *synchronous* parallel
code. That is, add barriers periodically at the stages of your code. This
makes semantic and performance debugging easier. After the code is fully
debugged, remove the barriers.

5. Good User Support (This actually should be Item 1)

I used the SP2 at MHPCC, where the user support group is first-rate. They
always promptly answer your questions. If they don't have the infomation, they
will research for you, and eventually give you a concrete answer.

I highly recommend the SP2 workshop at MHPCC, which is essentially free
(they only charge $75, of course, you have to pay for your travel and stay).
Even the experienced can learn quite a few things. Besides, you get to see
the beautiful Maui :-) The workshop material is at www.mhpcc.edu/mhpcc.html.
It has a lot of useful stuff. Check it out even if you are not using SP2.

You can also get help from the IBM people who designed SP2. For instance,
I got help on the HPS from Craig Stunkel, MPL from Howard Ho, MPI from
Hubertus Franke, and single-node performance from R.C. Agarwal. These
people always promptly answer your email and give you concret pointers.


PART II. Things I would like to see available in SP2 (and any other supers)

1. User Support Center

I think IBM (and any other supers company) should set up a center to
answer email questions from users, or even better, the public, who are
potential users. They should offer this service free, becuase it can
attract more people to use (and to buy) their systems. This center should
be set up following the MHPCC model.

For some time, we have been trying to find out how to use the POWER2 Performance
Monitor, which allows one to see thnigs like cache miss ratio. (Any tips?)
If there were an SP2 user support center from IBM, this question could be
answered right away. But now we have to rely on the MHPCC people, they are
still researching it. The companies are in the best position to
answer users' questions.

2. Overhead Formulae

It could save the user a lot of time and headache if the supercomputer companies
document and publish the overhead data on various task creation, communication
and synchronization operations. This is especially important at parallel code
*design* time. Ideally, these data should be expressed as close-form formulae,
following Roger Hockney's t = t_0 + message_length / r_infinity model.

Since such info was missing on SP2, we had to test and derive these fomulae
ourselves.

We have measured pingpong, bcast, index, scatter/gather, shift, barrier, reduce,
and prefix, using up to 16 MB messages on up to 256 dedicated nodes.
We then curve-fitted the timing data and obtained some close-form expressions:

MPL Operation   Time in microseconds

pingpong/2      46+0.035m               m is message length in bytes

pingpong/2      39+0.028m               (actually IBM's is more accurate
from IBM                                for large messages, >16KB)

barrier         94logN+10               N is the number of nodes

reduce          20logN+23

prefix          60logN-25

bcast           (16logN+10) + (0.025logN)m

gather/scatter  (17logN+15)+(0.025N-0.02)m

index           80logN + (0.03N^1.29)m

shift           (6logN+60) + (0.003logN+0.04)m

When we derived these expressions, we had no idea what forms the expressions
should be in, thus we had to try various forms. This can be done much more
accurately and efficiently by the vendors, because they are in the best position
to do a controlled experiment, and they know forms of the expressions.

Note that the above list is a partial one. In my opinion, the companies
should provide overhead expressions for all operations in MPI and PVM (e.g.,
grouping, task creation, inquiry, besides communication), and
overhead for locks and semaphores in shared memory machines.

3. Atomicity and Mutual Exclusion

Most current message-passing systems do not support things like atomic
transactions and critical sections (regions), except Express from Parasoft.

Atomicity and Mutual Exclusion are needed by the user applications. It does
not matter whether the application is implemented on shared-memory or
message passing systems.

I don't know how to implement Atomicity and Mutual Exclusion on SP2, using
either PVM, MPI, or MPL, except through schemes such as Lamport's distributed
mutual exclusion, which is time consuming and error-prone, and probably not
efficient on SP2. (Anyone has a better idea?) In our application, we had to 
change the algorithm to avoid atomic operation. It worked for some application
parameter ranges, which falls within the requirement, fortunately.
But we surely need such a functionality.

4. Eureka

This refers to the capability for one task to asynchronously notify other
tasks that something happened. This is available in Cray T3D architecture
(called eureka) and in Express (called exhandle). This feature is useful
in searching, and also in our application.

5. Dynamic Tasking

This refers to the capability for applications to create, terminate, and
migrate tasks at run time. This is useful for fault tolerance, among other
things. We can do it now using PVM on SP2, but it would be nice that
MPL and MPI have such a feature. There is an MPI proposal for creating tasks.

----------------------------------------------------------------------------

Thanks for reading this long message.

Any response, pointers, comments, even flames, are welcome.

Zhiwei Xu	zxu@aloha.usc.edu

