Newsgroups: comp.parallel.pvm
From: wos@prism.uvsq.fr (Stephane WOILLEZ)
Subject: Re: Experience with pvm+shm?
Organization: Laboratoire PRiSM, Universite de Versailles - St Quentin, FRANCE
Date: 13 Apr 1995 14:08:17 GMT
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Message-ID: <3mjb8h$86a@soleil.uvsq.fr>

In article <bhunt.796421542@brians.umd.edu>, bhunt@brians.umd.edu (Brian R. Hunt) writes:
|> johan@pa.twi.tudelft.nl (Johan Meijdam) writes:
|> >Johan Meijdam wrote:
|> >> At the moment, I am working on a PVM project in which I use the UNIX/C modules
|> >> sys/shm.h and sys/sem.h as well. Accoring to the PVM reference manual PVM also
|> >> uses those as well. My problem is that when I use asynchronous communication,
|> >> often (but not always) some of the first messages from one PVM to another
|> >> disappear. My PVM version is 3.3.5, my OS is HP-UX.
|> 
|> >> If you have any experience with PVM applications using those shared memory
|> >> functions, and know what problems they might cause, I would appreciate your
|> >> input.
|> 
|> >I forgot to mention, my machine is a HP 9000/735 and the version of HPUX is
|> >9.05. To PVM, this is a HPPA architecture.
|> 
|> I have a similar (I think) problem using DEC 2100 4/275 "Sables",
|> OSF/1 V3.0B, PVM 3.3.6, architecture ALPHAMP.
|> 
|> I am using a simple master/slave test program, based on master1.c and
|> slave1.c from $PVM_ROOT/examples but with a pool of tasks as in hitc.f
|> from the same directory.  I find that when the tasks are very short
|> (up to 1/4 of a second, with 3 slaves running, in the case I have in
|> mind), one or more of the slaves will often go into a deep sleep early
|> on, its message never having been received by the master.  The other
|> slaves complete the remaining tasks, but the master is left waiting
|> for that one task it never heard back about.  I presume the problem is
|> due to 2 slaves sending messages at nearly the same time and the
|> pvm_recv(-1, msgtype) call in the master only seeing one of the
|> messages.  The slave which doesn't terminate can generally only be
|> killed with "kill -9".
|> 
|> Now I've yet to have a problem when the tasks take more than a second,
|> even with larger numbers of processors, and I realize longer tasks are
|> preferable from an efficiency standpoint, but I'm worried that it's
|> only a matter of time before I have a problem in this case.
|> 
|> I am not doing anything directly with shared memory, and I don't
|> really know whether this is a shared memory problem, though I thought
|> so initially; I primarily have been running the master and slaves on
|> one (4-processor) machine, but have reproduced the problem (or some
|> other problem with similar symptoms) with the master and slaves
|> running on separate machines.
|> 
|> -- 
|> Brian R. Hunt
|> bhunt@ipst.umd.edu

I have the same problem than you. I use PVM3.3.7 on Sun Sparc workstations
architecture SUN4. My program is splited in several nodes. Each node is made
of 2 processes, on for the calculus and the other one for PVM I/O. Communications
between the 2 processes in one node are done with shared memory and 2 signals.
During the calculus, each node exchanges several of its datas with the other
nodes. And it seems that, at the begining the calculus, one PVM message is
simply lost. I test every PVM call and there is no error returned wich means
(normaly) that everything is fine. Problem is that one of my messages is lost
and it locks the whole algorithm. Every process that communicate owns it
personnal tag. If what you say is correct, the solution may lies in the fact
that one must never use pvm_recv with a -1 in the node field which implies that
we have to develop an algorithm that check every processor cue using pvm_nrecv.

I must also say that sometimes, the pvm deamon of one of my calculus nodes
generates an error like empty message field or incomplete message.

The problem with this deadlock in my program is that I am not sure that it's
because of me or because of PVM :-)

If somebody knows something about this problem, or holds the solution. Please
post it or mail it to me. Any comment is also welcome.

  Thanks,

    Stephane.

--------------------------------------------------------------------------
E-mail :
            Stephane.Woillez@prism.uvsq.fr
Web :
            http://www.prism.uvsq.fr/public/wos
--------------------------------------------------------------------------

