Newsgroups: comp.parallel.pvm
From: bhunt@brians.umd.edu (Brian R. Hunt)
Subject: Re: Experience with pvm+shm?
Organization: Project GLUE, University of Maryland, College Park, MD
Date: 28 Mar 95 20:12:22 GMT
Message-ID: <bhunt.796421542@brians.umd.edu>

johan@pa.twi.tudelft.nl (Johan Meijdam) writes:
>Johan Meijdam wrote:
>> At the moment, I am working on a PVM project in which I use the UNIX/C modules
>> sys/shm.h and sys/sem.h as well. Accoring to the PVM reference manual PVM also
>> uses those as well. My problem is that when I use asynchronous communication,
>> often (but not always) some of the first messages from one PVM to another
>> disappear. My PVM version is 3.3.5, my OS is HP-UX.

>> If you have any experience with PVM applications using those shared memory
>> functions, and know what problems they might cause, I would appreciate your
>> input.

>I forgot to mention, my machine is a HP 9000/735 and the version of HPUX is
>9.05. To PVM, this is a HPPA architecture.

I have a similar (I think) problem using DEC 2100 4/275 "Sables",
OSF/1 V3.0B, PVM 3.3.6, architecture ALPHAMP.

I am using a simple master/slave test program, based on master1.c and
slave1.c from $PVM_ROOT/examples but with a pool of tasks as in hitc.f
from the same directory.  I find that when the tasks are very short
(up to 1/4 of a second, with 3 slaves running, in the case I have in
mind), one or more of the slaves will often go into a deep sleep early
on, its message never having been received by the master.  The other
slaves complete the remaining tasks, but the master is left waiting
for that one task it never heard back about.  I presume the problem is
due to 2 slaves sending messages at nearly the same time and the
pvm_recv(-1, msgtype) call in the master only seeing one of the
messages.  The slave which doesn't terminate can generally only be
killed with "kill -9".

Now I've yet to have a problem when the tasks take more than a second,
even with larger numbers of processors, and I realize longer tasks are
preferable from an efficiency standpoint, but I'm worried that it's
only a matter of time before I have a problem in this case.

I am not doing anything directly with shared memory, and I don't
really know whether this is a shared memory problem, though I thought
so initially; I primarily have been running the master and slaves on
one (4-processor) machine, but have reproduced the problem (or some
other problem with similar symptoms) with the master and slaves
running on separate machines.

-- 
Brian R. Hunt
bhunt@ipst.umd.edu

