Newsgroups: comp.parallel.pvm
From: dwhite@trout.mrj.com (David White)
Subject: PVM Startup Blues Revisited
Organization: MRJ, Inc./Oakton, Virginia, USA
Date: 1 Jun 1994 12:29:49 -0400
Message-ID: <2sid1t$gh@trout.mrj.com>



I'm *very* sorry about this, I was unable to post until last week, and then as
soon as I posted the questions below, my news server went down, and any replies
are lost.  If there is an archive site, I'd love to know about it (and how to
access the archives intelligently).  With your pardon, below is my question:

     I am having some problems with a system using PVM.  The system starts up, 
and then starts PVM on all the machines in the network.  Then, if it is started
on another machine, the two can talk.  The problem occurs if one of the PVM 
daemons dies a flaming, grisly death, and a user starts the application on that
machine.  The application starts PVM on that machine, but this PVM doesn't know
about the others on the other machines, and tries to start them.  In any event,
the two PVMs never communicate.

     And what is particularly ugly is that if the daemon which dies is on the
machine on which the application (and therefore PVM) was first started, the 
other daemons are "orphaned".  It is then necessary to log on to every machine 
on the network and kill off the daemons manually, and clean up the pvmd.uid 
files.  If you are foolish enough to run the console on one of the other
machines, and try to halt the system, the console hangs.  Such strong master-
slave behaviour is bad news.

    In considering my options, I have come up with the following:

   1.  Use pvm_notify to let the application know when a daemon dies, and have
       the application restart the daemon on that machine.  This assumes the
       message sent to the application upon the daemon's death contains somehow
       the machine on which the dead daemon resided.  It also assumes that the
       dead daemon screams on its way out, which does not appear to be the
       case, since very often the console still thinks the machine is in the
       PVM.  I bet this approach wouldn't work!

   2.  Have the application check for PVM daemons on other machines before it
       starts one itself, and if one is found, request it to restart the daemon
       on its own machine.  This has flaws also, such as the fact that even if
       it works in the general case, it will not work if the daemon which died
       is the master.  Besides, I'm not sure how to go about this in detail.

   3.  Upon startup, have the application check for the presence of PVM daemons
       on each of the machines in the network.  If found, kill them and clean 
       up.  The down side is, of course, that the application which started the
       daemons will still be around, and will scream bloody murder, filling the
       terminal with unhappy pvm library error messages.  In PVM 3.1, if the
       daemon died, any applications which were connected to it also died, but
       this does not appear to be the case with PVM 3.2.  So, not only is this
       unbearably facist, it still wouldn't work completely.  It would have to
       be augmented with a search and destroy operation for the application
       itself on each machine, which, in adition to being also fascist, would
       be a pain, since the system actually consists of at least four different
       executables.

   4.  Chuck PVM and do the socket dance myself.  I would really rather not
       adopt this solution, since PVM is very convenient, does even more than I
       absolutely need, gives decent enough performance for the small packet
       sizes I use, handles any heterogeneity, and hides all the socket
       incantations so I can think about more productive things.  However, if I
       can't solve this problem, I'll have to do something.  Maybe #5...

   5.  Hack the PVM code itself, to do just what I want.  My stomach is jumping
       just thinking about it...  I once was forced to work on a system in
       which the mindless dweebs had hacked the X intrinsics...  They'll have 
       to shoot me next time!!!!!  I hate hacking packages!!!

     For a little more info, the steps the application takes at startup are:

   1.  Perform a pvm_mytid.  If it bounces, no PVM daemon exists.  Read in a
       configuration file listing the hosts to be used.  Write out a file in
       PVM's preferred format listing these hosts, and start the PVM daemon
       with a fork and execl.  Then start pvmgs with a fork and execl.  Now,
       perform the pvm_mytid.  (Actually, there *is* one more step.  Since the
       development stage tended to leave pvmd.uid files around, they are
       eliminated from all machines in the PVM.)

   2.  Start the other executables.  They in turn each perform step 1, which
       should succeed immediately.  However, if the daemon on this machine has
       died, as stated above, a whole new machine will be spawned.

     Any suggestions you may have will be greatly appreciated.  I would really,
really like to avoid learning grunge socket code.  PVM is quite nice, I just
need it to be a little smarter (or *I* have to be a little smarter!).

     Many thanks in advance.

Dave White
dwhite@mrj.com

