Newsgroups: comp.parallel.pvm
From: viry@ciril.fr (Laurence Viry-Daval)
Subject: error in use of pvm
Organization: CIRIL, NANCY, FRANCE
Date: 9 Jan 1995 13:46:53 GMT
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Message-ID: <3ereod$4a0@arcturus.ciril.fr>

Hi!

Please help for a incomprehension in use of pvm

We have a SP19076 (Ibm) .
We use pvm (oak-ridge) on this machine and a number of users have noticed the 
same problem
The message recepte in /tmp/pvmlxxxx is the folowing

rigel1 23: more /tmp/pvml.4008
[t80080000] ready   Fri Jan  6 10:19:31 1995
[t80080000] netinput() FIN|ACK from rigel6.ciril.fr
[t80080000]  hd_dump() ref 1 t1c0000 n "rigel6.ciril.fr" ar "RS6K" lo ""
[t80080000]            sa 192.70.84.39:3962 mtu 4096 f 0x0 e 0 txq 0
[t80080000]            tx 56 rx 57 rtt 0.003680
[t80080000] netoutput() timed out sending to rigel8 after 24, 189.038572
[t80080000]  hd_dump() ref 1 t40000 n "rigel8" ar "RS6K" lo ""
[t80080000]            sa 192.70.84.41:2160 mtu 4096 f 0x0 e 0 txq 5
[t80080000]            tx 476 rx 4978 rtt 0.006154
[t80080000] hostfailentry() lost master host, we're screwwwed
[t80080000] pvmbailout(0)

or 

a 17h44
rigel7:root> more /tmp/pvml.4008
[t80040000] ready   Wed Jan  4 17:02:00 1995
[t80040000] [tc0001]  task           3  read
[t80040000] [t200001]  task           8  read
[t80040000] [t180001]  task           6  read
[t80040000] [t1c0001]  task           7  read
[t80040000] [t140001]  task           5  read
[t80040000] [t80001]  task           2  read
[t80040000] [t100001]  task           4  read
[t80040000] netoutput() timed out sending to rigelsw5.ciril.fr after 22, 189.473250
[t80040000]  hd_dump() ref 1 t180000 n "rigelsw5.ciril.fr" ar "RS6K" lo ""
[t80040000]            sa 192.70.84.103:3683 mtu 4096 f 0x0 e 0 txq 0
[t80040000]            tx 11583 rx 1118 rtt 0.003855

- the same program can run correctly in a other moment. This problem depends 
on context (systeme, network)

-I know if a host fails, PVM will automatically detect this end and deleted 
the  host from the virtual machine. It seems that it is not the case in our 
problem.

- The slave cannot find the master but the master is always here

- what's the meaning of "netoutput() timed out sending to rigelsw5.ciril.fr 
       after 22, 189.473250"

Can somebody help us ?
If, can you send me the message at the following address :viry@ciril.fr

Viry-Daval Laurence
   C I R I L
Avenue du Doyen Roubault
54500 Vandoeuvre les Nancy
Tel :83 44 44 44
Fax :83 44 02 62   
e-mail:viry@arcturus.ciril.fr   


