Newsgroups: comp.parallel.pvm
From: Donald Krieger <don@neuronet.pitt.edu>
Subject: Runaway pvmd's
Organization: University of Pittsburgh
Date: 22 Dec 1994 20:30:26 GMT
Message-ID: <3dcnl2$nfe@usenet.srv.cis.pitt.edu>

Hi,

	We have a PVM application composed of a pvm-based daemon
and a cshell script which adds pvm hosts as computers appear on our
network and then spawns the daemon.  The daemon starts with a chunk
of code which detects if there is currently another daemon running
on the same host.  If so, the daemon which was most recently spawned
dies.  This code uses pvm_tidtohost(), i.e. it depends on the group
functions - all the daemons are enrolled in a single group.  Thus
when all is running properly we see a single pvmd and a single
daemon on each host.
	On several occasions we have seen multiple copies of the
pvmd and the daemon running on more than one host.  On each host
where this is happening, one of the pvmd's is getting nearly 100%
of the CPU.  When we halt pvm using the console, at least one and often
both of the pvmd's and their daemons remain and must be explicitly
killed.
	We are running pvm-3.3.4 on a network of from 30 to 40
HP RISC workstations (HPPA).  Our installation is compiled with
the standard configuration allowing one pvmd/host.  An additional
bit of information - when there are runaway pvmd's on a node, any
application we start on that node blocks indefinitely when it
attempts to connect to PVM - pvm_mytid().
	We have been able to reduce the frequency with which this
very troublesome fault occurs by modifying our script in the following
way.  When the script detects that a machine has come on the network
(with a successful ping) it waits for at least 60 seconds before
attempting to add that host.  We reasoned that this allows time for
the other networking daemons to come up as the machine is booting,
e.g. that which supports rsh.
	This appears to be a problem with PVM although we admit that
we are adding hosts in a nonstandard manner and are therefore pushing
the system in an unusual way.  Any help would be most welcome.
Thanks.

					Don Krieger


