Newsgroups: comp.parallel.pvm
From: rff@inf.rl.ac.uk (Ronald Fowler)
Subject: Re: Unexpected problems with pvm_spawn and
Organization: Rutherford Appleton Laboratory
Date: 27 Jul 1995 16:21:41 GMT
Message-ID: <3v8eel$1uek@unixfe.rl.ac.uk>

In article 1HM@cerc.wvu.edu, ericd@backus (Eric Dye) writes:
>Gordon Hogenson (ghogenso@u.washington.edu) wrote:
>: I'm having some trouble with a new installation of PVM 3.3.7.  I have
>: it installed on two machines, a SUN4 and and SGI5. Specifically,
>: the SGI is an IRIX 5.2 Indigo^2 and the SUN is SunOS 4.1.4. My program
>: (see below) is run (started from the shell prompt) on the SUN4, 
>: and spawns 1 process on the SGI5.
>
>: The first problem is that pvm_spawn always returns 0 on the first call,
>: (with -7 returned in the tids array), but on the second call it
>: returns 1 as expected and the 'tid' of the spawned task is correct.
>
>: Furthermore, the program calls pvm_joingroup().  The parent process
>: gets a return value of 0, as expected.  But for some unknown reason,
>: the spawned process on the SGI gets a return value of 11, whereas I
>: would have expected it to be 1.  The documentation says that it
>: counts upward and that the return value is the first unused id.  Why
>: '11' then?  Is the return value supposed to be an arbitrary number
>: or is this a bug?
>
>: Here's the program (same program on both machines):
>

[ code deleted]

>: Calling pvm_spawn again (as in the code) always solves this problem.
>
>: The remote output however shows that pvm_group is returning 11, not 1,
>: the next consecutive number available.
>
>: [t80040000] [t8001b] me = 11 mytid = 524315
>: [t80040000] [t8001b] tids[0] = 0; tids[1] = 0
>: [t80040000] [t8001b] Waiting for everyone to start up.
>: [t80040000] [t8001b] pvm_barrier returned 0
>
>: The above was found on the local machine in the /tmp/pvml.XXX file.
>
>: Expected output:
>: [t80040000] [t8001b] me = 1 mytid = XXXXXXX
>
>: Any suggestions?  Other programs such as spmd.c provided with PVM,
>: work fine.  "hello"/"hello other" work but usually not on the first
>: try.  I.e., invoking "hello" once fails, twice it works.
>
>: Other possibly useful information:
>
>: % pvm
>: pvmd already running.
>: pvm> conf
>: 2 hosts, 1 data format
>:                     HOST     DTID     ARCH   SPEED
>:                      t13    40000     SUN4    1000
>:                 t13graph    80000     SGI5    1000
>: pvm> 
>
>
>I am fairly new to PVM but I think I am having problems similar to 
>Gordon.  I am running PVM on a SUN4 and SUNMP.  When running the
>example programs several of them worked every other time.  I also
>tried a quicksort.c program which I found on one of the PVM pages.
>It also works every other time.  What I have found is that if
>I only use the SUNMP as my virtual machine then the programs work
>fine.  If the SUN4 is the only machine then they never work (give
>wrong answers on some programs and sometimes lock-up on others).  In most 
>cases I use the same code for both architectures.  How can I find out 
>what's going wrong when running on the SUN4?  Also I am having trouble
>printing inside a spawned task, what is the easiest or best way to
>do this?  Thanks.
>
>Eric Dye
>Morgantown, WV
>ericd@cs.wvu.edu

Well, just for contrast, I have pvm3.3.6 and seem to find that
SGI5 and SUN4 work fine, and SUNMP works for tasks on just SUNMP,
but communication SUNMP to SGI5 or SUN4 usually fails, often
just hanging in the first send/receive. I tried pvm3.3.7 on just the
SUNMP (with 3.3.6 on the SUN4), and that seems to cure it, at least
my simple round trip message passing loop SUNMP->SUN4 and back,
though the "testall" example code still fails trying to delete a host.
But should I risk upgrading the SUN4 and SGI5 to 3.3.7?

As I expect you know, the "working every other time" feature is most
likely due to the odd way pvm does its task placement.

Ron Fowler





