Newsgroups: comp.parallel.mpi
From: Matt Beare <M.Beare@uea.ac.uk>
Subject: MPI_Barrier and MPI_Bcast hangups
Organization: University of East Anglia
Date: Mon, 09 Dec 1996 17:40:54 +0000
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <32AC4F26.41C6@uea.ac.uk>

I posted a similar question last week and have still had no joy in
getting MPI_Barrier or MPI_Bcast to work on 4 or more workstations, when
using Fortran.

Using the following simple program as a example:

      program simple
      include 'mpif.h'
      integer  myid, nprocs, ierr, n

      call MPI_Init (ierr)
      call MPI_Comm_Rank (MPI_Comm_World, myid, ierr)
      call MPI_Comm_Size (MPI_Comm_World, nprocs, ierr)
      print *, 'Process ', myid, '    of ', nprocs

      if (myid .eq. 0)  n = 10
      call MPI_Bcast (n, 1, MPI_Integer, 0, MPI_Comm_World, ierr)
      print *, 'Process ', myid, '    received ', n

      call MPI_Finalize (ierr)
      stop
      end

When run on various numbers of workstations I get the following results:

% mpirun -np 2 simple
 Process            0    of            2
 Process            1    of            2
 Process            0    received           10
 Process            1    received           10

% mpirun -np 3 simple
 Process            0    of            3
 Process            1    of            3
 Process            2    of            3
 Process            0    received           10
 Process            1    received           10
 Process            2    received           10

% mpirun -np 4 simple
 Process            0    of            4
 Process            1    of            4
 Process            2    of            4
 Process            3    of            4
 Process            0    received           10
 Process            1    received           10
Timeout in waiting for processes to exit.  This may be due to a
defective rsh program (Some versions of Kerberos rsh have been observed
to have this problem).
This is not a problem with P4 or MPICH but a problem with the operating
environment.  For many applications, this problem will only slow down
process termination.
p3_11965: (60.089844) Trying to receive a message when there are no
connections; Bailing out
rm_l_135922496_11973: (60.089844) received incorrect handshake message
type=1
rm_l_135922496_11973:  p4_error: slave_listener_msg: broken handshake: 1

% mpirun -np 6 simple
 Process            0    of            6
 Process            1    of            6
 Process            2    of            6
 Process            3    of            6
 Process            4    of            6
 Process            5    of            6
 Process            0    received           10
 Process            4    received           10
Timeout ... [error as above]


This comms problem only seems to occur with Fortran programs, with the
same program written in C working as expected on the same workstations.

If anyone as any idea as to what I may need to do to get Fortran to
behave correctly, I would be very grateful.

Thanks,
	Matt.

---------------------------------------------------------------------------
Matthew Beare              |  Email: M.Beare@uea.ac.uk
School of Mathematics      |  Tel: +44 (0)1603 592990
University of East Anglia  |  Fax: +44 (0)1603 259515
Norwich  NR4 7TJ  England  |  WWW:
http://www.mth.uea.ac.uk/people/mib.html
---------------------------------------------------------------------------

