Newsgroups: comp.parallel.pvm
From: pauls@iil.intel.com (Paul Schuster)
Subject: Re: building robust applications
Organization: Intel Israel (74) Ltd.
Date: 2 Jun 1996 07:51:50 GMT
Message-ID: <4orh6m$nrp@ilx018.iil.intel.com>

Graham Nash (gnash@ncube.com) wrote:
: Paul Schuster wrote:
: > 
: > I am using PVM 3.3.10 as a message passing kernal for a prototype
: > distributed data collection program.
: > 
: > One of my PVM tasks monitors the others + the pvmd's and attempts to
: > restart them if they fail. Most of the time this works fine, but I
: > occasionally see that attempts to restart a failed pvmd repetitively
: > fail due to the existence of an old /tmp/pvmd.<UID> file. As soon as
: > I remove that file on the remote machine, the restart succeeds. (Not
: > sure if it is because of this file or the named pipe it points to).
: > 
: 
: pvmd3 tries hard to remove this file when it fails since the rules are
: that if the file exists, it cannot restart (just what you have
: discovered). I think you should look hard at why the daemon is failing.
: It cannot catch every means of being killed (kill -9, for example),
: but it does catch a lot of them. In normal circumstances, the daemon
: should not even fail (it is fairly robust itself) so maybe if this is
: happening frequently, you have discovered a bug, and if you report it,
: we will all eventually benefit.
: 
: Otherwise, you could always modify the daemon startup script (pvmd) to
: include "rm /tmp/pvmd.<uid>". This ought to do the trick, but does not
: inspire confidence in the product.
: 
: Incidentally, try running the daemon with debug tracing on, for it
: could well help you isolate any pbs you are experiencing. The trace is
: stored into "/tmp/pvml.<uid>".
: 
: Hope this helps
: 
: Graham Nash

Thanks for response.

Actually the reason for failure, is that the application is running over
a WAN and if the WAN fails, PVMD exits since it it can not see the master
PVMD. I will try the latest PVM patch which tries to handle unreliable
networks. I will also try to find out why the /tmp/pvmd.<UID> file is not
removed when PVMD exits under such a scenario.

BTW, would it not be sensible to change to /tmp/pvmd.<UID> file to contain
the PID of the PVMD as well as the location of the named pipe ? That way
a better check could be made by PVMD at startup for existing processes.

Thanks,

Paul Schuster.

