Newsgroups: comp.parallel.pvm
From: papadopo@cs.utk.edu (Philip Papadopoulos)
Subject: Re: Barrier time problems in Fortran
Organization: CS Department, University of Tennessee, Knoxville
Date: 24 Jan 1995 08:55:35 -0500
Message-ID: <3g30snINNhn@duncan.cs.utk.edu>

In article <3g2p46$sl9@sirio.cineca.it> destri@doncarlos.eng.unipr.it(Giulio Destri) writes:
>Hello Everybody,
>
>We are trying to port a very intensive Fortran program from
>the Cray T3D (Cray PVM) to a workstation cluster (Public Domain
>PVM version 3.3.6), operating in a SunOs 4.1.2 and 4.1.3 environment.
>The problem we have found is the incorrect working of the PVM
>barrer calls, when the execution times of the single processes
>(tasks) are high.
>Operating on different hosts (e.g. SPARC-20 vs SPARC-2), with
>different workload, and the necessity to access critical resources
>(e.g. buses) force identical tasks to have different execution
>times. The waiting times at a barrier can be very high for the
>first processes arriving at this point of execution.

The barrier is working exactly as it should. All processess are held
at the barrrier until ALL processess have completed the barrier.

>The stop in the processing phase seems to have no explanations from the
>point of view of the Operating System. 

Actually, the processes that have reached the barrier are waiting for
a "release" message from the group server. In the network version of PVM,
the processes are waiting on a socket in a "select" statement. Since they
are merely waiting for an event and are not "busy waiting" the OS will
swap out the process until the message comes through.

>So the stopped projects are "swapped out" from the memory active area.
>When the lowest processes finally arrive at barrier point, all the
>processes can be released from block.
>The last arrived processes can continue their running but the others
>can not be recalled in active memory.

This is not a correct assessment. The other processes will become active
when their OS's schedule them for execution after being awakened by
the incoming barrier release message.

>In this way at the next barrier only some tasks can arrive, and the
>system enters in a permanent blocking wait state.

I would check to see if after the first barrier some of your processess
are hung waiting for messages.  It is  quite possible that  you are suffering
from a race in your messaging pattern that shows up when processes are of
different speeds but disappears when on a balanced machine (like the T3D).

Try out XPVM -- it will trace all the messages (or only some of the messages)
that are being generated in your code.  Some important events that you will see 
   1. processess waiting on a receive
   2. processess that have left the machine
   3. hosts that have left the machine 
>
>We would like to know:
>a. Is our diagnosis correct?
>b. Are there some variables to be set during compilation or execution
>   phase of a similar PVM program?
>c. Have we to write a different construct to enhance the simple barrier
>   call?

Cheers,
Phil

Philip Papadopoulos 
Research Staff Member
Mathematical Sciences Section
Oak Ridge National Laboratory
Oak Ridge, TN 37831-6367


