Newsgroups: comp.parallel.mpi
From: jeff@is.s.u-tokyo.ac.jp (Jeff McAffer)
Subject: Re: Broadcasting and ordering
Organization: University of Tokyo / Object Technology International
Date: 29 Apr 1995 06:20:30 GMT
Message-ID: <3nslre$sp4@isnews.is.s.u-tokyo.ac.jp>

In article <3npb9f$i5i@NNTP.MsState.Edu> tony@aurora.cs.msstate.edu (Tony Skjellum) writes:

 >MPI2 committee is considering this, but not very favorably.  I am working
 >on a proposal for the collective extensions chapter that will define 
 >the semantics simply as follows:
 >
 >a) MPI_MCAST must have the same properties as a sequential send loop
 >b) Regular send must be used to receive it 
 >c) Special multicast hardware can be used to implement it, provided a) and b)
 >are satisfied.
 >
 >MCAST has terrible properties:
 >	  -- progress rule violation
 >	  -- deadlock potential
 >	  -- order violation imminent
 >
 >and has been criticized and discussed extensively by forum.  It is also argued
 >that its true performance characteristics are illusory, because it assumes
 >that processes are lined up, as they might  be for a MPI_BCAST.  Earlier
 >argumentation has appeared on this on this newsgroup.

Sorry if this is a repeat of previous threads.  Is there a digest
somewhere?  Does anyone have the thread(s) around?  I'd like to see
what was said.

I'm not sure about your list of bad properties.  Perhaps you are
proposing a MCast with some sort of blocking semantics?  I'm not sure
that that is the only useful form.  My systems don't use blocking
sends at all.  The semantics (and actual behaviour) of a non-blocking
MCast and a loop of non-blocking p2p sends should be identical.  If
you have hardware support, great!  If not, it is essentially a macro.
By expressing your high-level intent (to send the same message to a
group of receivers) you allow the implementation to do the best it
can.

 >I will also propose a variation:
 >    MPI_MCAST_SUBGROUP(buffer, count, type, tag, group, comm)
 >which I expect not be accepted.  It will broadcast to subgroup of comm's group.
 >This would be a nicety, but is very different semantically from other MPI functions.
 >The syntactic sugar argument might win, and force us to stay only with MCAST.  Clearly,
 >MPI_MCAST_SUBGROUP subsumes MPI_MCAST as a special case, but not necessarily without
 >more checking.

Yes, actually this is the most interesting form IMHO.  As I see it,
MCAST is just a MCAST_SUBGROUP with the comm group elements for the
subgroup arg.  Of course, knowing that you are sending to the entire
comm group (e.g., MCAST) may enable optimizations on some platforms.

What form would your group arg take?  Would this be just some array of
ids or are you thinking of some new MPI structure?  I would go for the
former.  If you are using a new structure, how is this different from
just constructing a new comm group and using MCAST?  This is fine for
relatively static applications with known comm patterns.  You can
create all the comm groups and just pick what you need.  But for
anything more dynamic it would be nice to have lightweight subgroups.
To put things in a bit of perspective I give an example of why I'd
like this behaviour

I have a distributed Smalltalk system (very dynamic symbolic
object-oriented computations) in which objects can be selectively
replicated on different nodes.  When one copy of an object is updated,
we have to coordinate with all the others to maintain consistency.
This coordination involves (a series of) message sends between objects
in just a subset of the nodes in the system (only those nodes which
have replica).  This set of nodes is dynamic as objects are replicated
on a demand basis determined at runtime and impossible to predict.  It
is vitally important to performance that the coordination be as fast
as possible.  IMHO this is a perfect example of the requirement for
MCAST_SUBGROUP with lightweight subgroups.  Note that BCAST or
blocking semantics are not suitable here because we only want to lock
the object being changed, not the whole application on each node.
This locking is taken care of the by the application itself and is
outside the scope of MPI.

Anyway, I hear that the folks at ANU have implemented MCAST in their
MPI for the AP1000.  I'm not sure of the exact semantics (e.g.,
blocking etc) but from what I see they meet the spec you give above
(a, b and c) and they do use hardware support. Luckily, I'm using an
AP1000 so I'll have to look into that.  I would like to see it on my
other platforms.

JEff

--
ato de, |m        -- Why do women always leave the toilet seat down?

