Newsgroups: comp.parallel.pvm
From: dodd@csl.sri.com (Chris Dodd)
Subject: PVM and fault tolerance/error detection
Organization: Computer Science Lab, SRI International
Date: 11 Jul 1996 17:28:41 -0700
Message-ID: <4s467p$oed@tulip.csl.sri.com>


I've been trying to sort out some problems I've been having with PVM and
error detection.  It seems that pvm_send/pvm_recv will happily allow you
to send packets to/recv from a non (or no longer) existing task.  If you
do a pvm_recv from a task that's died, you just block forever.  To get
around this I've come up with the following sequence of steps to do
a `safe' recv from a specific task:

    do a pvm_notify so I'll get a message if the remote task exits
    do a pvm_tasks to find out if the task has already exited
    do a pvm_recvf to register a special message acceptor so I can
      receive EITHER a message from the task OR a task exit notification
    do a pvm_recv to actually receive a message.
    Test the message to see if its a message from the task or a task
      exit notification (possibly for some other random task)
    do a pvm_recvf to reset the message acceptor

This seems like an awful lot of work.  Anyone have a better solution?
There are also a few potential problems:

    It appears to be the case that task exit messages in response to
      pvm_notify come from task 0x80000000, but its not documented
      anywhere.
    It appears to be the case that pvm_tasks, when asked about a specific
      task, will always return a count of 0 (for a non-existent task)
      or 1, but its not documented anywhere.
    A global variable called `pvm_errno' appears to exist, and setting
      it is required to get pvm_perror to print messages about the
      error you want, but, again, the documentation is silent.

Anyways, here's the code I've currently come up with to do safe recv.
Any comments on things that could go wrong with it?

#include <pvm3.h>
/* msgtag for pvm_notify to use for task exit messages */
#define DEATH_MSG_TAG	0xDEAD
#define DEATH_MSG_TID	0x80000000

/* not documented, but seems to work */
extern int pvm_errno;

static int rf(int bufid, int tid, int tag)
{
int	rv, buftid, buftag;

    if ((rv = pvm_bufinfo(bufid, 0, &buftag, &buftid)) >= 0) {
	if (buftid == DEATH_MSG_TID && buftag == DEATH_MSG_TAG)
	    rv = 1;
	else if ((tid == -1 || tid == buftid) &&
		 (tag == -1 || tag == buftag))
	    rv = 1;
	else
	    rv = 0; }
    return rv;
}

/* call pvm_recv, except return PvmNoTask if the tid doesn't exist, or
** dies before it sends us a message */
int pvm_safe_recv(int tid, int tag)
{
int			bufid, rv, cnt;
int			(*old_rf)(int, int, int);
struct pvmtaskinfo	*info;

    if (tid != -1) {
	if ((rv = pvm_notify(PvmTaskExit, DEATH_MSG_TAG, 1, &tid)) < 0)
	    return rv;
	if ((rv = pvm_tasks(tid, &cnt, &info)) < 0)
	    return rv;
	if (cnt != 1)
	    return (pvm_errno = PvmNoTask);
	old_rf = pvm_recvf(rf); }

retry:
    bufid = rv = pvm_recv(tid, tag);

    if (bufid >= 0 && tid != -1) {
	int	buftid, buftag;
	if ((rv = pvm_bufinfo(bufid, 0, &buftag, &buftid)) < 0)
	    return rv;
	if (buftid == DEATH_MSG_TID && buftag == DEATH_MSG_TAG) {
	    /* a death message */;
	    if ((rv = pvm_upkint(&buftid, 1, 1)) < 0)
		return rv;
	    if (buftid == tid)
		return (pvm_errno = PvmNoTask);
	    goto retry; }
	rv = bufid; }

    if (tid != -1)
	pvm_recvf(old_rf);

    return rv;
}

Chris Dodd
dodd@csl.sri.com

