Newsgroups: comp.parallel.mpi
From: Marcus Marr <marr@dcs.ed.ac.uk>
Subject: Message Passing Deadlocks
Organization: Computer Systems Group, The University of Edinburgh
Date: Tue, 13 Dec 1994 11:28:16 GMT
Message-ID: <MARR.94Dec13112817@balnagowan.dcs.ed.ac.uk>


I am interested in knowing how the readers of this newsgroup have
interpreted the MPI standard with regard to the interference of
communication in different contexts/communicators, and the progress
requirement.

I have an application built on a pipeline structure where one stage
`floods' the next stage with messages using MPI_Send().  This second
stage consumes the messages in cooperation with several other
processes.  The cooperation is performed using a separate
communicator.  (A simple example program is appended to this message).

So far, I have tried the program on five different implementations of
MPI:
        ok 	  - Chimp-MPI (Edinburgh) on a network of Suns
	deadlocks - LAM-MPI (Ohio Supercomputer Center) on Suns
	ok	  - MPICH (Argonne) on a single Sun
	ok	  - ANU-MPI on a Fujitsu AP1000
	ok	  - Native Cray T3D-MPI (in Edinburgh)

The LAM implementation deadlocks when the buffer of the middle process
is flooded and the process cannot cooperate with the other processes.
Increasing the size of the buffer does not help unless the buffer can
contain the entire flood of messages from the first processor.

One way to look at it is that the messages in one context are
preventing the messages in the second context being sent and received,
so there is interference between the communicators, but it has been
suggested that the problem is more one of progression.

In the section on `Resource Limitations', the standard states (section
3.5, p31) that ``a standard send operation that cannot complete
because of lack of buffer space will merely block, waiting for buffer
space to become available or for a matching receive to be posted.''

Is my program expecting too much from the implementation?
How do other implementations cope?

Hope you can help,

Marcus.

-------- cut here --------
/*
   Deadlock Test : Needs at least 3 processes running
   Marcus Marr 5/12/94
*/

#define LARGE_BUFFER_SIZE   100000
#define SMALL_BUFFER_SIZE   10
#define LOOPS               100

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

void main(int argc, char *argv[]) 
{
  int rank, i, *large_buffer, *small_buffer;
  MPI_Status status;
  MPI_Comm sub_comm, remote_comm;

  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  
  large_buffer=(int*)calloc(LARGE_BUFFER_SIZE, sizeof(int));
  small_buffer=(int*)calloc(SMALL_BUFFER_SIZE, sizeof(int));
  
  if (rank==0) {
    MPI_Comm_split(MPI_COMM_WORLD, 0, rank, &sub_comm);
    MPI_Intercomm_create(sub_comm, 0, MPI_COMM_WORLD, 1, 0, &remote_comm);
  }
  else {
    MPI_Comm_split(MPI_COMM_WORLD, 1, rank, &sub_comm);
    MPI_Intercomm_create(sub_comm, 0, MPI_COMM_WORLD, 0, 0, &remote_comm);
  }

  for (i=0;i<LOOPS;i++) {
    switch(rank) {
      case 0: 
	printf("[0]: sending block %d of %d ints\n", i, LARGE_BUFFER_SIZE);
        MPI_Send(large_buffer, LARGE_BUFFER_SIZE, MPI_INT, 
		 0, i, remote_comm);
	break;
    case 1:
	printf("[1]: receiving block %d of %d ints\n", i, LARGE_BUFFER_SIZE);
	MPI_Recv(large_buffer, LARGE_BUFFER_SIZE, MPI_INT, 
		 0, i, remote_comm, &status);
	/* compute 'summary' table and broadcast it */
	printf("[1]: broadcasting summary block %d of %d ints\n", 
	       i, SMALL_BUFFER_SIZE);
	MPI_Bcast(small_buffer, SMALL_BUFFER_SIZE, MPI_INT, 0, 
		  sub_comm);
	/* work on the full raw data */
      break;
    default:
	printf("[%d]: receiving summary block %d of %d ints\n", 
	       rank, i, SMALL_BUFFER_SIZE);
	MPI_Bcast(small_buffer, SMALL_BUFFER_SIZE, MPI_INT, 0, 
		  sub_comm);
	/* work on the 'summary' table */
	break;
    }
  }

  MPI_Barrier(MPI_COMM_WORLD);
  printf("[%d] finished ok\n",rank);
  MPI_Finalize();
}


