Newsgroups: comp.parallel.pvm
From: jjc@iastate.edu (James J Coyle)
Subject: Re: PVM hangups
Keywords: hang
Organization: Iowa State University, Ames, Iowa USA
Date: 10 Jan 1995 22:32:59 GMT
Message-ID: <3ev1ur$10r@news.iastate.edu>


In article <3ehp09$mvb@airgun.wg.waii.com>, denham@wg.waii.com (Scott Denham) writes:
|> 
|>      We are a new PVM user, using PVM 3.3.3.  In our application,
|> a master task running on one node of an SP-2 spawns multiple slave
|> tasks, each running on a separate node of the SP-2 (one of the
|> slaves runs on the same node as the master).  The master sends
|> data to each slave via pvmfinitsend/pvmfsend, then waits for a
|> message from any slave indicating that its work (which may take
|> hours or days) is done.  The master then sends a message requesting
|> that the slave send back its results, which are contained in hundreds
|> of messages, each with a unique message tag.  The slave then executes
|> 
|>       DO I=1,N
|>          CALL PVMFINITSEND(PvmDataInPlace)
|>          CALL PVMFSEND(...MSGTAG(I)...)
|>       ENDDO
|> 
|> while the master simultaneously executes
|> 
|>       DO I=1,N
|>          CALL PVMFRECV(...MSGTAG(I)...)
|>       ENDDO
|> 
|> When all the data from one slave has been received, the master waits
|> for a message from the next slave whose work is complete, which may
|> take hours.
|> 
|>      In some cases, when running on dedicated or lightly loaded nodes,
|> the application completes successfully.  When running in a normal
|> production job mix, however, the application "hangs" during execution
|> of the loops shown above when data is being sent back from one of the
|> slave tasks.  In a typical case, all the data from one slave task is
|> sent and received successfully, then the hang occurs during the data
|> transmission for the next slave task.  The slave appears hung in 
|> send J (for example, J=65), while the master is hung in receive K
|> where K.LT.J (for example, K=32).  There are no error indications
|> from any of the PVM calls preceding the ones that hang.
|> 
|>      The PVM User's Guide indicates that this type of thing might
|> be caused by a memory shortage.  If so, would PVM or the application
|> be the likely culprit?  Is there a way to increase the memory
|> available to PVM?  Do you have any suggestions on how to confirm
|> whether there is a memory problem, or on the best method to debug
|> this type of problem? 
|> 
|>      Thanks.
|> 
|>           Stan Goldberg (stan.goldberg@wg.waii.com)
|>           Scott Denham (scott.denham@waii.com)
|>           Western Geophysical Co.
|>           Houston, TX
|> 

Suggestion:

ORIGINAL CODINIG:

Slave code:
      DO I=1,N
         CALL PVMFINITSEND(PvmDataInPlace)
         CALL PVMFSEND(...MSGTAG(I)...)
      ENDDO

while the master simultaneously executes

      DO I=1,N
         CALL PVMFRECV(...MSGTAG(I)...)
      ENDDO
*

Try instead:

Slave code:
      integer  send_more_tag, slave_id
        DO I=1,N
         CALL PVMFINITSEND(PvmDataInPlace)
*            pack data
         CALL PVMFSEND(master_id,...MSGTAG(I)...)
         CALL PVMFRECV(master_id,...send_more_tag...)
      ENDDO

while the master simultaneously executes

      DO I=1,N
         CALL PVMFRECV(slave_id,....MSGTAG(I)..)
*            unpack data
         CALL PVMFSEND(slave_id,...send_more_tag...)
      ENDDO

   It is possible that the system running the master program is being 
flooded with messages. The new code avoids that problem, since the slave 
waits till the master asks for another message before sending any more.
   In this new coding, there is more communication latency, but since 
you indicate that each slave works for hours or days between message 
bursts, I suspect that the new code will run nearly as fast as the original.
   If not, simply communicate more messages between requests for more data.

eg. strip mine the loop to a depth of 5, and request more data after each
    strip rather than after every message.
      
        DO II=1,N-4,5
	 DO I=II,min(II+4,N)
           CALL PVMFINITSEND(PvmDataInPlace)
*            pack data
           CALL PVMFSEND(master_id,...MSGTAG(I)...)
          ENDDO
          CALL PVMFRECV(master_id,...send_more_tag...)
        ENDDO

while the master simultaneously executes

        DO II=1,N-4,5
	 DO I=II,min(II+4,N)
           CALL PVMFRECV(slave_id,....MSGTAG(I)..)
*            unpack data
        ENDDO
        CALL PVMFSEND(slave_id,...send_more_tag...)
      ENDDO

    I hope this solves your problem. 

    As for getting more debug information on memory usage, I usually watch memory 
usage on the various processors using the ps command. That's the best I have come up 
with.

Jim Coyle
Research Computing Group
Comp. Center
Iowa State Univ.                                        
-- 
                                       James Coyle, PhD
				       Research Computing Group
				       235 Durham Center
				       Iowa State Univ.
				       email: jjc@iastate.edu
				       phone: (515)-294-2099

