Newsgroups: comp.parallel.mpi
From: jareed@gamera.syr.edu (Judith Ann Reed)
Subject: SUMMARY (problem persists): SP2 specific problem, can't alloc. nodes
Organization: Syracuse University, Syracuse
Date: 3 Nov 1995 15:41:39 GMT
Message-ID: <47dd7j$ep0@newstand.syr.edu>

Thank you to the people on the sp-discussion list and on the 
comp.parallel.mpi newsgroup who replied to my questions. 

I described a problem we are having where mpich 1.0.11 on AIX 3.2.5, pssp 1.2
won't run over the switch (in user space), though it did several days ago. 
The info included below was useful, but the problem persists, as I will 
detail below.
	* Users probably must customize mpirun and mpirun.ch_mpl, or
	  use poe directly, to get your environment variables 
	  set for individual preferences.
	* The error we are getting - 
		"0031-124 Couldn't allocate nodes for parallel execution"
	  is telling us that the nodes the JM picked out have their
	  switch adaptors allocated. 
	* MP_EUILIB should be set to "us" to use user space, "ip" to use IP.
	  MP_INFOLEVEL can be adjusted from 1-4(+?) to provide debugging 
	  output.
	  MP_RMPOOL can be set to select a specific pool.
	* Look for other jobs on each node that might be using the switch -
	  search for poe, pmd, etc. and kill them if they are running. (none)
	* Try restarting the switch. (did that)
Our problem persists:
* Concern: In mpirun.ch_mpl, there is a note:
	"This only works on SPx running release 2 software and with
	 the high-performance switch" 
  It is not clear if this note is referring to mpich-1.0.11, or just a segment
  of the script the message is found in? Anyone know?
* Concern: A search of the 12 nodes shows that *two* of them have active "jmd"
  processes - one on the first node in the server list, one on the second
  node in the server list - stopping the RM and restarting it causes this
  behavior to recur. Do others see this on their systems? Is it the norm?
* Concern: "ipcs" shows shared memory segments allocated on most of 12 nodes
  by myself and one other user, though neither of us have processes open
  on any of them, or jobs active. Our reading on "ipcs" and "ipcrm" 
  indicates we have shared memory segments allocated - could this be at
  the root of the problem? I tried "ipcrm", but it just set the shared mem
  sgmt key to 0, did not release it.
* Info: MP_EUILIB=ip does work, MP_EUILIB=us doesn't work.

I'll append output from ipcs, and from mpirun with debug turned on. Thanks
again for the info that's been sent, we anxiously await further insights!
---------------------------------------------------------------------
ipcs output (only info we thought relevant is attached, for brevity)
---------------------------------------------------------------------
***merlin1
IPC status from /dev/mem as of Sun Jan 25 08:02:56 EST 1970 (??????????????)
T     ID     KEY        MODE       OWNER    GROUP
Shared Memory:
m      1   00000000 D-rw-------    alvin   system
***merlin2
IPC status from /dev/mem as of Sun Jan 25 08:02:56 EST 1970
T     ID     KEY        MODE       OWNER    GROUP
Shared Memory:
m      1   00000000 D-rw-------    alvin   system
***merlin3
IPC status from /dev/mem as of Sun Jan 25 08:02:56 EST 1970
T     ID     KEY        MODE       OWNER    GROUP
Shared Memory:
m      1 0x444d4131 --rw-------   judith   system
***merlin4
IPC status from /dev/mem as of Sun Jan 25 08:02:56 EST 1970
T     ID     KEY        MODE       OWNER    GROUP
Shared Memory:
m      1 0x444d4131 --rw-------   judith   system
***merlin5
IPC status from /dev/mem as of Sun Jan 25 08:02:56 EST 1970
T     ID     KEY        MODE       OWNER    GROUP
Shared Memory:
m      1 0x444d4131 --rw-------    alvin   system
***merlin6
IPC status from /dev/mem as of Sun Jan 25 08:02:56 EST 1970
T     ID     KEY        MODE       OWNER    GROUP
Shared Memory:
m      1 0x444d4131 --rw-------    alvin   system
***merlin7
IPC status from /dev/mem as of Sun Jan 25 08:02:56 EST 1970
T     ID     KEY        MODE       OWNER    GROUP
Shared Memory:
m      1 0x444d4131 --rw-------    alvin   system
***merlin8
IPC status from /dev/mem as of Sun Jan 25 08:02:56 EST 1970
T     ID     KEY        MODE       OWNER    GROUP
Shared Memory:
m      1 0x444d4131 --rw-------    alvin   system
***merlin9
[ no active shared memory segments except root ]
***merlin10
IPC status from /dev/mem as of Sun Jan 25 08:02:56 EST 1970
T     ID     KEY        MODE       OWNER    GROUP
Shared Memory:
m      1 0x444d4131 --rw-------   judith   system
***merlin11
IPC status from /dev/mem as of Sun Jan 25 08:02:56 EST 1970
T     ID     KEY        MODE       OWNER    GROUP
[ no active shared memory segments except root ]
***merlin12
[ no active shared memory segments except root ]
---------------------------------------------------
debug mpirun output
---------------------------------------------------
merlin1 {judith} 13: ./mpirun -np 4 hello
MP_EUILIB=us
MP_HOSTFILE=
MP_INFOLEVEL=4
MP_PROCS=4
MP_PULSE=0
MP_RMPOOL=0
INFO: DEBUG_LEVEL changed from 0 to 2
D1<L2>: mp_euilib = us
D1<L2>: node allocation strategy = 1
INFO: 0031-690  Connected to Resource Manager  
D1<L2>: Using css0 as euidevice for User Space job
D1<L2>: Forcing dedicated adapter for User Space job, task 0
D1<L2>: Forcing dedicated adapter for User Space job, task 1
D1<L2>: Forcing dedicated adapter for User Space job, task 2
D1<L2>: Forcing dedicated adapter for User Space job, task 3
D1<L2>: Elapsed time for call to jm_allocate: 26 seconds
ERROR: 0031-124  Couldn't allocate nodes for parallel execution.  
                     Exiting ...
ERROR: 0031-603  Resource Manager allocation for task: 0, 
 node: merlin1.npac.syr.edu, rc = JM_PARTIONCREATIONFAILURE
ERROR: 0031-603  Resource Manager allocation for task: 1, 
 node: merlin2.npac.syr.edu, rc = JM_PARTIONCREATIONFAILURE
ERROR: 0031-603  Resource Manager allocation for task: 2, 
 node: merlin3.npac.syr.edu, rc = JM_PARTIONCREATIONFAILURE
ERROR: 0031-603  Resource Manager allocation for task: 3, 
 node: merlin4.npac.syr.edu, rc = JM_PARTIONCREATIONFAILURE
ERROR: 0031-635  Non-zero status -1 returned from pm_mgr_init
D2<L2>: In pm_exit... About to call pm_remote_shutdown
D2<L2>: Elapsed time for pm_remote_shutdown: 0 seconds
D2<L2>: In pm_exit... About to call jm_disconnect
D2<L2>: Elapsed time for jm_disconnect: 0 seconds
D2<L2>: In pm_exit... Calling exit with status = -1 at 
               Fri Nov  3 10:25:27 1995


Judith Reed
judith@npac.syr.edu
systems@npac.syr.edu


-- 
 Judith Reed - sysmgr - Northeast Parallel Architecture Center
 jareed@syr.edu
 judith@npac.syr.edu
 "Old enough to be amazed at the technologies I encounter daily"

