Newsgroups: comp.parallel.pvm
From: ghogenso@u.washington.edu (Gordon Hogenson)
Subject: Unexpected problems with pvm_spawn and pvm_joingroup
Organization: University of Washington
Date: 20 Jul 1995 21:41:54 GMT
Message-ID: <3umij2$4ab@nntp5.u.washington.edu>

I'm having some trouble with a new installation of PVM 3.3.7.  I have
it installed on two machines, a SUN4 and and SGI5. Specifically,
the SGI is an IRIX 5.2 Indigo^2 and the SUN is SunOS 4.1.4. My program
(see below) is run (started from the shell prompt) on the SUN4, 
and spawns 1 process on the SGI5.

The first problem is that pvm_spawn always returns 0 on the first call,
(with -7 returned in the tids array), but on the second call it
returns 1 as expected and the 'tid' of the spawned task is correct.

Furthermore, the program calls pvm_joingroup().  The parent process
gets a return value of 0, as expected.  But for some unknown reason,
the spawned process on the SGI gets a return value of 11, whereas I
would have expected it to be 1.  The documentation says that it
counts upward and that the return value is the first unused id.  Why
'11' then?  Is the return value supposed to be an arbitrary number
or is this a bug?

Here's the program (same program on both machines):


#include <stdio.h>
#include "pvm3.h"

int tids[2];

int main()
{ 
  int i,j;
  int info;
  int mytid;
  int me;

  mytid = pvm_mytid();
  
  /* Join a group and if I am the first instance */
  /* i.e., me = 0, spawn more copies of myself */
  
  me = pvm_joingroup("foo");
  printf("me = %d mytid = %d\n", me, mytid);
  tids[me] = mytid;
  if (me == 0)
    {
      int numt = 0;
      while (! numt)  /* keep trying until pvm_spawn succeeds */
	{
	  numt = pvm_spawn("tst2", (char**)0, 0, "", 1, &tids[1]);
	  printf("pvm_spawn returned %d\n", numt);
	  if (numt == 0)
	    {
	      printf("tids array tids[1] is %d\n", tids[1]);
	    }
	}
    }

  printf("tids[0] = %d", tids[0]);
  printf("; tids[1] = %d\n", tids[1]);

  /* Wait for everyone to start up before proceeding */
  
  printf("Waiting for everyone to start up.\n");
  info = pvm_barrier("foo", 2);
  printf("pvm_barrier returned %d\n", info);
  /*-----------------------------------------------------------*/

  return 0;
}

The local output is:

me = 0 mytid = 262204
pvm_spawn returned 0
tids array tids[1] is -7
pvm_spawn returned 1
tids[0] = 262204; tids[1] = 524316
Waiting for everyone to start up.
pvm_barrier returned 0

The '-7' problem (the error code translates as 'Specified executable 
cannot be found') is unexplained.
Calling pvm_spawn again (as in the code) always solves this problem.

The remote output however shows that pvm_group is returning 11, not 1,
the next consecutive number available.

[t80040000] [t8001b] me = 11 mytid = 524315
[t80040000] [t8001b] tids[0] = 0; tids[1] = 0
[t80040000] [t8001b] Waiting for everyone to start up.
[t80040000] [t8001b] pvm_barrier returned 0

The above was found on the local machine in the /tmp/pvml.XXX file.

Expected output:
[t80040000] [t8001b] me = 1 mytid = XXXXXXX

Any suggestions?  Other programs such as spmd.c provided with PVM,
work fine.  "hello"/"hello other" work but usually not on the first
try.  I.e., invoking "hello" once fails, twice it works.

Other possibly useful information:

% pvm
pvmd already running.
pvm> conf
2 hosts, 1 data format
                    HOST     DTID     ARCH   SPEED
                     t13    40000     SUN4    1000
                t13graph    80000     SGI5    1000
pvm> 

Gordon.
-- 
---------------------------------------------------------------
Gordon J. Hogenson                       work: (505) 667-9471
ghogenso@u.washington.edu                home: (505) 661-6753
---------------------------------------------------------------

