Newsgroups: comp.parallel.mpi
From: gdburns@osc.edu (Greg Burns)
Subject: Re: MPI Processes
Organization: Ohio Supercomputer Center
Date: 12 Feb 1996 21:53:47 -0500
Message-ID: <4foufr$jtu@tbag.osc.edu>

In article <Pine.SUN.3.91.960212203406.3904Y-100000@barney.cs.utk.edu> Graham E Fagg <fagg@barney.cs.utk.edu> writes:
>
>Firstly is MPIL_Spawn callable from within an already running
>application, i.e. one with its own context already defined?
>Therefore the children will have a *different* COMM_WORLD?

Yes.

>Doesn't that mean that under the original definition (as in MPI-1) of an
>inter-communicator, that the processes cannot then communicate as they
>don't share COMM_WORLD (that was split into separate groups)?

These issues are all handled as per the current working proposal
for MPI-2.  Briefly, each call to MPIL_Spawn() (will become MPI_Spawn()
when there is a std) creates an independent world communicator whose
members are the child processes.  The children also get a "parent"
inter-communicator, returned by MPIL_Comm_parent().  The parent
process gets the same inter-communicator returned from MPIL_Spawn().
Point-to-point communication is thus facilitated and intra-communicators
can be constructed at will.

>And finally, in the example given on page 52 of the draft
>MPI_Primer, for a fault tolerant system, it was stated that if a child died
>then the inter-communicator would be freed. Even if you spawn a new
>process it would have a different COMM_WORLD to the original lost process,
>so how would the other processes from the original group communicate with
>it (i.e. collective operations)? Would they have to use
>inter-communicators (instead of intra-communicators) or would they recall
>MPI_Init to somehow reform the group?

I will have to reread that section carefully before removing the draft
label.  We did something slightly different.  We simply invalidate
any communicator containing a process that has died.  Any request
in progress or any future usage generates an error condition that
can be handled.  A typical tactic is to detect the error in the application
and free the communicator.  Thus, we do not recover broken communicators.
A simplistic strategy for a fault tolerant MPI program (more precisely, an
MPI program expecting fault tolerant behaviour when running on LAM),
is to rely upon communicators of size 2.  When one process dies,
the survivor discards the communicator.  In our master/slave example,
the master has a separate communicator for each slave.

-=-
Greg Burns				gdburns@tbag.osc.edu
Ohio Supercomputer Center		http://www.osc.edu/lam.html

