Newsgroups: comp.parallel.pvm
Path: ukc!uknet!lyra.csx.cam.ac.uk!warwick!pipex!howland.reston.ans.net!EU.net!sun4nl!cs.vu.nl!newshost.cca.vu.nl!iodine.chem.vu.nl!wiesen
From: wiesen@iodine.chem.vu.nl (Gijsb. Wiesenekker)
Subject: pvm and host time outs
Message-ID: <1994Mar29.070332.19178@cca.vu.nl>
Sender: news@cca.vu.nl
Reply-To: wiesenecker@sara.nl
Organization: VU Amsterdam - dienst CCA
X-Newsreader: TIN [version 1.2 PL2]
Date: Tue, 29 Mar 1994 07:03:32 GMT
Lines: 45

We are running pvm3 on an IBM SP1 consisting of 8 nodes.
Due to the different load on the different nodes,
some tasks run faster than other tasks. This is 
the situation after two hours wall clock (first 
column cpu time, second column elapsed time, 
third column %CPU)

BANDPAR     00:24:35    01:55:11  21.3
BANDPAR     00:24:37    01:55:11  21.4
BANDPAR     00:24:41    01:55:12  21.4
BANDPAR     00:24:39    01:55:13  21.4
BANDPAR     00:19:20    01:55:15  16.8
BANDPAR     00:19:27    01:55:16  16.9
BANDPAR     00:19:06    01:55:17  16.6
BANDPAR     00:18:55    01:55:17  16.4

Two minutes later I lost two tasks:

BANDPAR     00:25:02    01:57:23  21.3
BANDPAR     00:25:02    01:57:23  21.3
BANDPAR     00:25:05    01:57:24  21.4
BANDPAR     00:25:13    01:57:25  21.5
BANDPAR     00:19:34    01:57:27  16.7
BANDPAR     00:18:55    01:57:28  16.1

Inspection of the pvm log reveals that I am having
trouble with timeouts:

[t801c0000] ready   Tue Mar 29 07:10:43 1994
[t801c0000] netoutput() timed out sending to shivan1.sara.nl after 23, 182.85726
0
[t801c0000]  hd_dump() ref 1 t40000 n "shivan1.sara.nl" ar "RS6K" lo ""
[t801c0000]            sa 192.87.102.1:1391 mtu 4096 f 0x0 e 0 txq 0
[t801c0000]            tx 2434 rx 2895 rtt 0.006378
[t801c0000] hostfailentry() lost master host, we're screwwwed
[t801c0000] pvmbailout(0)

Any ideas where this timeout comes from? Is this an internal PVM time
limit? How can I increase/recover from it?

Regards,
Gijsbert Wiesenekker
Dept. of Theoretical Chemistry
Vrije Universiteit
Amsterdam

