Newsgroups: comp.parallel.mpi
From: gdburns@osc.edu (Greg Burns)
Subject: crash recovery (was: Re: (no subject))
Organization: Ohio Supercomputer Center
Date: 22 May 1996 15:16:59 -0400
Message-ID: <4nvp7b$b70@tbag.osc.edu>

In article <4nvmqc$a3o@portal.gmu.edu> Jaroslaw Tuszynski <jarek> writes:
>Is there any easy way to implement time out on reseave command, something like
>PVM's pvm_trecv? Sometimes one of my nodes crashes, for reasons unrelated to
>MPI, like not enough swap space when one of my cooworkers runs something heavy,
>and all the other machines wait for the message from that machine. In PVM
>version of my program I got a time out and can close all the files and exit,
>instead of waiting for a whole weekend.
>Any help would be greatly appriciated.

In the particular case of not enough swap space, I would expect this
error to happen and be detected during application start-up.  The
other processes shouldn't make it past MPI_Init().

In the general case of a node crashing, LAM 6.0 will raise an error
on all surviving processes that may be blocked on a receive call from
processes on the crashed node.  If you have error handling set to "return"
on the relevant communicator, the receive call will return and you can
then follow on with the recovery strategy you describe.  You have to
specify the -x option to lamboot (or lamgrow) to get this measure of
fault tolerance.

-=-
Greg Burns				gdburns@osc.edu
Ohio Supercomputer Center		http://www.osc.edu/lam.html

