Newsgroups: comp.parallel.mpi
From: spike@trinity.llnl.gov ( Richard J. Procassini )
Subject: MPICH and CRI PVM Performance on the T3D.
Organization: Lawrence Livermore National Laboratory
Date: 13 Oct 1995 21:25:45 GMT
Message-ID: <SPIKE.95Oct13142546@trinity.llnl.gov>

To any MPICH gurus out there,

	I've got some timing results from the Cray T3D MPP that seem
to make no sense whatsoever.  I am comparing the performance of the
non-blocking, unbuffered (??) sends and receives available within MPI
and the buffered blocking sends and non-blocking receives in CRI's
PVM.  Consider the following code fragment, which is used to perform a
boundary exchange of data in an explicit, unstructured mesh code:

****************************************************************

c     Set the floating-point word length.
      lenflt = 8

#ifdef MPI
c     Phase 1: the Receive (Post) step.
      do 10 i = 1, mynabors
        nabor  = mcontro(1,i)
        lstart = mcontro(2,i)
        numsnp = mcontro(3,i)

        msglen  = lenflt*numsnp
        msgsrc  = nabor
        msgtype = 1073741824

c       Post a non-blocking receive for the partial masses.
        call mpi_irecv(rbuf(lstart), msglen, MPI_BYTE, msgsrc, msgtype,
     +                 MPI_COMM_WORLD, msgreqr(i), ierror)
 10   continue
#endif

c     Synchronize the processors.
#ifdef MPI
      call mpi_barrier(MPI_COMM_WORLD, ierror)
#endif

#ifdef PVM
      call pvmfbarrier(PVMALL, -1, ierror)
#endif

c     Turn on the timer.
      timeon = irtc()

c     Phases 2 and 3: the Gather and Send steps.
      do 30 i = 1, mynabors
        nabor  = mcontro(1,i)
        lstart = mcontro(2,i)
        numsnp = mcontro(3,i)

c       Phase 2: the Gather step.
        do 20 j = 1, numsnp
          index1       = lstart + j - 1
          index2       = mpsendl(index1)
          sbuf(index1) = xms(index2)
 20     continue

c       Phase 3: the Send step.
        msglen  = lenflt*numsnp
        msgdest = nabor
        msgtype = 1073741824

#ifdef MPI
c       Send the partial masses via a non-blocking, unbuffered send.
        call mpi_irsend(sbuf(lstart), msglen, MPI_BYTE, msgdest,
     +                  msgtype, MPI_COMM_WORLD, msgreqs(i), ierror)
#endif

#ifdef PVM
c       Send the partial masses via a blocking, buffered send.
        call pvmfpsend(msgdest, msgtype, sbuf(lstart), msglen, BYTE1,
     +                 ierror)
#endif
 30   continue

c     Turn off the timer.
      timeoff = irtc()
      timediff = (timeoff - timeon)*6.6666e-9

c     Check for arrival of the messages containing the partial masses.
      nrecd = 0
      do while (nrecd .ne. mynabors)
        do 40 i = 1, mynabors

#ifdef MPI
          msgcomp  = .false.
          if (msgreqr(i) .ne. MPI_REQUEST_NULL) then
c           A non-blocking test for receipt of the partial masses.
            call mpi_test(msgreqr(i), msgcomp, msgstat, ierror)
          endif
          if (msgcomp) then
            nrecd = nrecd + 1
          endif
#endif

#ifdef PVM
          nabor  = mcontro(1,i)
          lstart = mcontro(2,i)
          numsnp = mcontro(3,i)

          msglen  = lenflt*numsnp
          msgsrc  = nabor
          msgtype = 1073741824

c         Post a non-blocking receive for the partial masses.
          call pvmfnrecv(msgsrc, msgtype, msgreqr(i))
          if (msgreqr(i) .gt. 0) then
            call pvmfunpack(BYTE1, rbuf(lstart), msglen, 1, ierror)
            nrecd = nrecd + 1
            msgreqr(i) = 0
          endif
#endif
 40     continue
      enddo

****************************************************************

	The critical loop here is the "30" loop, which is over the
number of neighboring processors.  The data is "gathered" from the
work arrays into the send buffer in loop "20", and then sent to the
neighboring processor via the:
	(1) Unbuffered (??), non-blocking "mpi_irsend" call in MPI, or
	(2) Buffered (??), blocking "pvmfpsend" call in PVM.	

	Note that the "mpi_irecv" receives have already been posted in
the "10" loop above, and all the processors are synchronized between
the "10" and "30" loops.  Therefore, the "mpi_irsend" call SHOULD be a
very lightweight DMA data transfer from sender to receiver which does
not block, while the "pvmfpsend" call blocks and also has the added
cost of a memory-to-memory data copy (user-to-system buffer packing).

	So the question is: Why is the time required to complete the
"30" loop for the same problem/decomposition ("timediff") 50% larger
for the MPI implementation than it is for the PVM implementation?  I
am at a total loss for words to explain this result (as well as being
somewhat stunned)!  Anyone (especially the ANL/MSU MPICH developers if
you're out there) care to comment/hazard a guess as to what the $#@%&*
is going on here?  Thanks in advance.  Ciao...
--

				Dr. Richard Procassini
				Methods Development Group
				Mechanical Engineering Department
				Lawrence Livermore National Laboratory
				Mail Stop L-122
				P.O. Box 808
				Livermore, CA  94551

				(510)424-4095

				spike@llnl.gov

