Newsgroups: comp.parallel.pvm
From: Graham Nash <gnash@ncube.com>
Subject: Re: building robust applications
Organization: Integratek
Date: Tue, 28 May 1996 16:31:54 GMT
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <31AB2A7A.1D06@ncube.com>

Paul Schuster wrote:
> 
> I am using PVM 3.3.10 as a message passing kernal for a prototype
> distributed data collection program.
> 
> One of my PVM tasks monitors the others + the pvmd's and attempts to
> restart them if they fail. Most of the time this works fine, but I
> occasionally see that attempts to restart a failed pvmd repetitively
> fail due to the existence of an old /tmp/pvmd.<UID> file. As soon as
> I remove that file on the remote machine, the restart succeeds. (Not
> sure if it is because of this file or the named pipe it points to).
> 

pvmd3 tries hard to remove this file when it fails since the rules are
that if the file exists, it cannot restart (just what you have
discovered). I think you should look hard at why the daemon is failing.
It cannot catch every means of being killed (kill -9, for example),
but it does catch a lot of them. In normal circumstances, the daemon
should not even fail (it is fairly robust itself) so maybe if this is
happening frequently, you have discovered a bug, and if you report it,
we will all eventually benefit.

Otherwise, you could always modify the daemon startup script (pvmd) to
include "rm /tmp/pvmd.<uid>". This ought to do the trick, but does not
inspire confidence in the product.

Incidentally, try running the daemon with debug tracing on, for it
could well help you isolate any pbs you are experiencing. The trace is
stored into "/tmp/pvml.<uid>".

Hope this helps

Graham Nash

