Newsgroups: comp.parallel,comp.parallel.pvm
From: tony@aurora.cs.msstate.edu (Tony Skjellum)
Subject: Re: MPI limitations? (Multicast)
Organization: Mississippi State University
Date: Thu, 7 Jul 1994 12:44:39 GMT
Message-ID: <CsKM2G.39t@dcs.ed.ac.uk>

par@sdiv.cray.com (Peter Rigsbee) writes:

>Newsgroups: comp.parallel
>Path: nntp.msstate.edu!olivea!spool.mu.edu!howland.reston.ans.net!EU.net!uknet!festival!dcs.ed.ac.uk!parallel
>From: par@sdiv.cray.com (Peter Rigsbee)
>Subject: Re: MPI limitations? (Multicast)
>Message-ID: <CsGo88.A3H@dcs.ed.ac.uk>
>Originator: paramod@crystal.epcc.ed.ac.uk
>Sender: cnews@dcs.ed.ac.uk (UseNet News Admin)
>Organization: Cray Research, Inc.
>Date: Tue, 5 Jul 1994 09:40:56 GMT
>Approved: paramod@epcc.ed.ac.uk
>Lines: 52

>In article <Cs2642.2vv@dcs.ed.ac.uk>, jim@meiko.com (James Cownie) writes:

>> Multicast (a send to many processes, which is received with a normal receive) 
>> is a non-trivial addition to MPI, and was NOT omitted as the result of an oversight,
>> but because of the implementation difficulties it would pose, and  
>> additional cost it would add to every point to point message operation.
>> 
>> Consider the following requirements :-
>> 1) Multicast should be more efficient than the linear performance the user
>>    can obtain by simply doing non-blocking sends in a loop followed
>>    by a waitall. (In other words we expect the system to use a broadcast
>>    tree and perform the operation logarithmically).

>I disagree with this.  

>I don't think the semantics of multicast require that it be implemented 
>with a broadcast tree.  While there are performance benefits of doing so,
>there are also (as you point out later) serious usability problems.  But 
>rather than leading to the conclusion that multicast shouldn't be added, 
>perhaps these problems should simply lead to the conclusion that the 
>broadcast tree is an inappropriate implementation.

>Why implement multicast with less than log performance?
>	- It can still be more efficient.  The user only makes one call
>	  to MPI instead of 'n' sends and a 'wait'.  Not only does this
>	  reduce the number of calls, but it also means that MPI need only
>	  check some parameters once instead of 'n' times.  Further, even
>	  if MPI implements it as a loop over the list of tasks, it can
>	  do so at a lower-level than is available to the user, saving more
>	  overhead.  These savings can add up...
>	- The user code is simpler to understand and easier to maintain.
>	  One call instead of a loop of sends and a wait.
>	- Conceptually, the user sees this as a single transmission of data
>	  to a number of tasks.  The single call matches the mental model.
>	- It makes it easier to port codes that already use multicasts
>	  (such as from PVM).

>Clearly, when possible, broadcast would be preferable to this slower 
>multicast.  But for others, the close coupling inherent in broadcasts
>may not be feasible.

>The Cray T3D PVM implementation of pvm_mcast is exactly this lower-level
>loop over the list of tasks.  There is a measurable performance benefit
>over the user loop of pvm_sends.  Its not a huge benefit, but combined
>with the ease-of-use issues, multicast seems to have some benefit.

>I don't think multicast ought to be dismissed simply because one possible
>implementation strategy won't work.

>	- Peter Rigsbee

MPI is the product of input from many sources, including significant
vendor input, including Cray.  So, Cray could have pushed for this
feature (Peter was their representative).  We can consider it again for
MPI2 as well, if it is proposed.  

It is clear that there are incremental advantages to pushing down into
the system sets of closely related operations, rather than having the
user provide the loop.  These are exactly those noted by Peter above,
and are reasonable!

However, this case is probably not the only case where one might want to
request multiple actions on a single datatype with a single system
call!  Persistent communication objects in MPI try to provide this
higher performance for users in a more general setting, so the idea was
not missed in MPI, just not packaged into a
guaranteed-to-be-linear-complexity multisend command.   Using
persistent objects, the user could create such a multisend, and expect
it to be much more efficient than a simple loop around a send!  So, I
claim that one can reasonably argue that the multisend is layerable on
top of MPI without much if any loss of performance, except if the
destinations of the multisend change often, in which case, the
persistent object approach would be unviable.  If the destinations of
multisend are the group of a given communicator, and that communicator
is used a number of times, we have a mechanism to provide the call.
[So, the argument of convenience remains, but the argument of performance
is pretty mitigated by the existence of persistent objects.]

However, it is still more interesting to discuss whether one can offer
scalable broadcasts be developed that don't require the root to be
known a-priori, that don't have to use O(P) sends from the root for P
processes in the group, when P is greater than 5 - 10 destinations.

Notice that broadcasts can be done reliably in MPI because the
participants are known, so one can guarantee that back-to-back
broadcasts with same or different roots on the same communicator do not
get mixed up [what I have called back-masking at various times].  This
feature permits a broadcast tree, but requires 1) knowing the
participants, 2) have a known root.  Under these circumstances, an
implementation that uses point-to-point transmissions to effect the
broadcast can avoid any wildcard receives.  It also permits a linear
algorithm when the number of recipients is small, and the tree is
unjustified.  Thus, the protocol allows a poly-algorithm in the number
of destinations.

What are alternatives when the root is not known?  

1) An extremely efficient allreduce over just the root information is
feasible, if the length of the broadcast is large, thereby hiding the
cost of the Allreduce [also called combine in other systems].  This
could be followed by a standard broadcast.

2) If a subset of the processes know the root, one could write an
algorithm that is better than 1) but not as cheap as the standard
broadcast where everyone knows the root.  One possible option would
be to transmit the data to a known root (eg, process rank 0), provided
that process 0 knows the rank of the real root.  In that case,
the cost of the broadcast would be send/receive pair + the original
broadcast tree.

3) Support multicast messages with broadcast tree in a communicator,
where the tag is used to distinguish different broadcasts.  All
processes in the communicator participate in all broadcasts, with
unknown roots.  The user promises to maintain the consistency to
the broadcasts by assigning semantic meanings to the tags of each
broadcast.
   
4) Utilize a protocol where the recipients ranks are packaged into the
message itself so that even the group of the recipients need not be
known by any process other than the root process. Use an updating table
of how to broadcast the message.  This adds O(P) data to the message
itself, so the root still has to do O(P) work, but that will usually be
a small coefficient compared to the cost of the broadcast itself, even
for moderate sized broadcasts.  With the table approach, a receiving
process can determine if it should do more sends, and so on and to
whom.  The receiving processes have to use wildcards on source, but not
on tag (there could be a broadcast tag).  So, the user could ask for
broadcasts by name, in this case, rather than by root (with the name
encapsulated in the tag).  As long as the user provides a consistent
meaning for the tags, there is no problem about back-masking here
either.  This would allow for tree broadcasts to subsets of groups
in a single communicator.
[One would work in a duplicate of the user's communicator, to keep
such messages apart from user send/receives.]

5) If active messages are available (or remote memory access as on
T3D), can always make sure that a communicator has a group-wide shared
variable to order broadcasts, without roots, but this becomes a shared-memory
race issue, with locks, etc.  

-------------

To some extent, it would be nice to have incomplete broadcasts, and features
of this nature (MPI_Ibcast, MPI_IAllreduce).  We elected to handle the
situation by saying that parallel threads would do this, rather than including
these calls directly in MPI.  They are tricky to implement in reasonable
ways.

-Tony

--
	.	.	.	.	.	.	.	.      .
"There is no lifeguard at the gene pool." - C. H. Baldwin
            -             -                       -
Anthony Skjellum, MSU/ERC, (601)325-8435; FAX: 325-8997; tony@cs.msstate.edu



