Newsgroups: comp.parallel.mpi
From: tony@aurora.cs.msstate.edu (Tony Skjellum)
Subject: Proposal (First version) MPI MCAST
Summary: MPI Needs a multicast, here is our first-release proposal
Keywords: MPI, multicast, proposal for extension
Organization: Mississippi State University
Date: 26 Feb 95 19:55:08 GMT
Message-ID: <tony.793828508@aurora.cs.msstate.edu>


Multicast proposal #1
(MPI Forum 1.5 Proposal)
Anthony Skjellum
Nathan Doss
Mississippi State University
February 26, 1995

Synopsis
--------
A multicast that acts as a multi send has been requested by 
several users.  We propose a new chapter of the standard on
semi-collective operations, to include MPI_Mcast() variants
described below.

Discussion
----------
MPI-F's rationale for not including this feature in the original
standard was briefly as follows:
	* all receives would have to test an extra envelope bit
	  to see if the message was "regular" or "multicast"
	* the cost of implementing this feature by the user, with
	   persistent objects, would not be prohibitive
	* point-to-point performance was not to be compromised
	  unduly for collective performance (mantra of MPI)
	* what would the recipients posit for rank and tag for
	  this operation?

Extensive net discussion ensued over the last year.  Furthermore, L.
Kale of UIUC has brought this issue up at at least two meetings
(Supercomputing '94, and SIAM Seventh Conf. on Parallel Proc. for
Scientific Computing), and called this as the most serious omission in
MPI.  Others have complained as well. 

It is consistently argued that for certain systems, it is exactly this
tradeoff that is wanted by users.  So, how do we do this, and still
avoid impacting send and receive?  We ask that the MPI-F should
consider again the need for MPI_Mcast, given real user and developer
demand.  This constructive proposal seeks to do this with minimal
impact on communication that does not make use of this "semi-collective
operation." We seek options that lead to high performance
implementations.

Possible features of implementation proposed
I) The accept that all communicators deal with overhead of mcasts
   a)  use sender's true rank, and special tag, ordering guarantees,
	can be overridden by tag = MPI_SRC_ANY
   b)  user manages all tags, no ordering guarantees
OR
II) The MPI_Mcast should only impact communicators that utilize it.
   That is, send and/or receives might be slower to support the
   overhead needed for this mcast to work.
   a) A "protocol" synchronization might be required to turn this
    feature on/off for a communicator. 
   AND/OR
   b) A version of communicator constructors that has a flag field for
      special features could be added.  The existing constructors would
      assume inheritance (as appropriate) of the mcast capability.
      Later, other flags could be specified that help with performance
      relevant issues.  For instance, we already know that we want a
      MPI_Comm_dup() with NO INHERITANCE OF ATTRIBUTES, and now we have
      to resort to MPI_Comm_split() [or even MPI_Comm_create] to get it.
   AND
   c) Send/receive semantics are optimized with special tags and
        srcs, as noted above in Scenario Ia.
   OR
   d) Same as Ib.

We favor II.  Furthermore, we consider options between a), b), and c)
below. 

*) In either I) or II):
   The MPI_Mcast must be implementable with an algorithm relevant to
   number of recipients.  For many participants, spanning trees must be
   possible.  For few participants, linear sends must be possible.  In
   either event, the behavior must be consistently described to the user,
   and in no event will the context concept be violated.
   a) The entire group of the communicator
   b) a subset of these.

   We will propose two major call variants.  The second will subsume the
   first, by including a group to specify recipients.  This will be 
   analogous to use of MPI_Comm_create where a communicator and subset
   group appear in the construction of a new communicator.  It will
   be less convenient to use, if one intends to broadcast to whole
   group, since a separate group object must be created and maintained.

*) A natural extension to intercommunicators must be possible
*) Deadlock does not arise for non-erroneous programs -- loosely,
   Mcast is not allowed to be a synchronization.  Conceptually, it
   should behave a lot like a loop with multiple sends.

Syntax /Semantics
-----------------
Scenario #I
*) The overhead on all receives to provide this feature is accepted,
   presumably by supporting a bit in the message envelope that tells
   the recipient processor to store and forward, if a logarithmic
   algorithm is used.

   MPI_Mcast(buffer, count, datatype, [tag1,] comm, ierr) [sender only]
   MPI_Recv(buffer, count, datatype, rank, tag2, comm, status, ierr)
   		[Recipients only]

   MPI_Mcast_group(buffer, count, datatytpe, [tag1], comm, grp, ierr)
		[sender only, grp is a subset of group of comm]

*) All intra and intercommunicators support this operation with no
   change to other MPI calls.  comm is either an intra- or inter-comm
   in the above calls.

What are the typical values of tag1, tag2 and rank?

sub-scenario #a
tag1 = argument not used (like true collective ops)
tag2 = MPI_MCAST_TAG
rank = sender's true rank

Implementations are responsible for using tag2 and rank as source
identification for the mcast.  Two mcast's from the same root
are received in the same order sent, providing partial ordering
extension to mcast.

If rank = MPI_SRC_ANY, then the guarantees of ordering are
relaxed, as in sub-scenario #b.

The use of MPI_MCAST_TAG allows implementations to pose receives
that know they have to do store-and-forward broadcast or other
additional work.  Furthermore, by using MPI_IRecv in that way,
one could hope for higher performance of the MPI_Mcast.  In combination
with persistent objects, other optimizations might be possible.

If a message is sent to a process from the root before the
MPI_Mcast, does the recipient get it before or after the 
MPI_Mcast data? For this sub-scenario, where the recipient is Mcast
aware, the answer should be that the partial ordering be maintained,
except when src = MPI_MCAST_ANY is used.

sub-scenario #b (recipient only vaguely mcast aware)
tag1 = user-specified
tag2 = user-specified
rank = MPI_SRC_ANY

Implementation promises nothing about ordering.  User manages
the tags used for multicasts, and has to be sure that, within
a communicator, no two multicasts with the same tag are active
at the same time.  Otherwise, unpredictable results occur.

No guarantees of ordering are provided.  The user must use
appropriate tags (and possibly multiple communicators) to get
appropriate behavior.


For the operation MPI_Mcast_group(), the message is sent to all
members listed in the group.  MPI_Mcast() does not send a copy
of the message to the sender, so the calls have legitimately
different applicability.  Furthermore, the MPI_Mcast_group()
call allows substructure to be posed by an application on a
communicator without making more communication contexts.  Since
groups might be more in flux than communicators, this offers
a lighter weight mechanism for describing broadcast boundaries.

We favor sub-scenario #a.

Scenario #II

IIa.  No communicator that does not have the semi-collective attribute set
      may use the semi-collective operations (currently MCASTs).

      That is, a program is erroneous if it violates this condition.

      MPI_Comm_set_semi_collective(comm, ierr)
      MPI_Comm_unset_semi_collective(comm, ierr)
      This call is a synchronization over all the members of group of
      comm.  
    
IIb.  Same rule as for IIa.

      MPI_Comm_dup_selective(comm, newcomm, flags, ierr)
      MPI_Comm_split_selective(comm, color, key, newcomm, flags, ierr)
      Two orable flags are suggested:

	MPI_COMM_SEMI_COLLECTIVE
	MPI_COMM_NO_ATTRIBUTES  (see keys and user  attributes)

IIa and IIb are not mutually exclusive.

IIc&d.  Semantic choices are same as for Scenario Ia&b.   Hopefully,
     implementations can avoid overhead for communicators that do
     not demand semi-collective capability.  


Summary
-------
This concludes the first version of our proposal.  We are asking
for the establishment of two variants of MCAST in MPI, with 
choices on whether or not communicators must activate this feature.
The choices posed are aimed at reducing overhead for communicators
that do not exploit MCASTs, and should be judged on their ability
to do so.

We invite further discussion, and hope to finalize this proposal
for MPI meeting in March, so MPI 1.5 can have this feature (assuming
it is approved :-)).

--
Anthony Skjellum, Asst. Professor, MSU/CS/ERC, Ph: (601)325-8435; FAX: (601)325-8997.
Mosaic: http://www.erc.msstate.edu/~tony; e-mail: tony@cs.msstate.edu
Maxim:  "There is no lifeguard at the gene pool." - C. H. Baldwin
	.	.	.	.	.	.	.	.      .      .

