Newsgroups: comp.parallel.mpi
From: nevin@tbag.osc.edu (Nicholas Nevin)
Subject: LAM 6.0 performance
Organization: Ohio Supercomputer Center
Date: 15 Mar 1996 13:51:51 -0500
Message-ID: <vn2g2ba42bc.fsf@tbag.osc.edu>


Previous versions of LAM, an MPI implementation for clusters, were performance
handicapped because all communication was routed through LAM's per-node daemon,
much like PVM in default mode.  Any library-only implementation, most notably
MPICH, was significantly faster because communication is direct between
application processes, using underlying network services (TCP/IP on clusters).
Our initial motivation for implementing with a daemon was to supply superior
debugging capabilities and a faster test cycle (inspect, recompile, rerun).

LAM 6.0, the new release, removes this performance handicap, because it
is also capable of direct process-to-process MPI communication, bypassing
the daemon.  The choice between enhanced debugging or peak performance
is made on the mpirun command-line; there are no code provisions.

% mpirun myapp

traditional mode with max inspection capability of processes and messages
at runtime

% mpirun -c2c myapp

"client-to-client" or "go-fast" mode, bypassing the daemon and getting the
best performance from the underlying system

To back up this advertisement, we studied the performance of LAM 6.0
and MPICH 1.0.12 on our DEC AXP cluster, connected via a FDDI network
(no giga-switch).  The results are presented in Ohio Supercomputer Center
Technical Report OSC-TR-1996-4, available from http://www.osc.edu/lam.html
or ftp://ftp.osc.edu/pub/lam.

Here is a brief summary of the results.

underlying system:	  8 DEC AXP 3000/300 workstations, OSF/1 V3.2, FDDI
MPICH device:		  ch_p4
MPICH compilation:	  -O, -nodevdebug
LAM compilation:	  -O
LAM mpirun flags:	  -c2c (hi-perf), -O (homogeneous network)

MPICH ch_p4 sets up TCP/IP connections on a demand-driven basis.
To remove this effect all measurements were started after a priming
communication on all necessary connections.  MPI_BYTE was the only datatype
used.  The point-to-point measurements were done on 2 nodes.
The collective measurements were done on 8 nodes.

LAM 6.0 compared favourably with MPICH 1.0.12 on all measurements.
For instance, on the ping-pong tests:

BYTES   LAM mean (secs)  MPICH mean (secs)
0       0.001534         0.002295
1024	0.002166         0.002814
4096    0.003532         0.004154
8192    0.005507         0.007021
16384   0.010508         0.010693

and on the alltoall tests:

BYTES   LAM mean (secs)  MPICH mean (secs)
0       0.005947         0.016650
1024    0.008646         0.020332
4096    0.019356         0.032949

Again, the full data from all the tests as well as more detail on the
test parameters is available in the tech report.  The point is not
to say that LAM has superior performance to MPICH on common platforms.
Both systems are under continued development and the most recent release
will probably be the fastest.  The point is that both implementations are
optimizing for high performance along the same path, the direct and efficient
exploitation of the underlying system.

-=-
Nick Nevin				nevin@osc.edu
Ohio Supercomputer Center		http://www.osc.edu/lam.html

