Newsgroups: comp.parallel.pvm
From: e8710@etlrips.etl.go.jp (Asai Yoshihiro)
Subject: Re: Error with PVM 3.3.1
Organization: Electrotechnical Laboratory, Tsukuba Science City
Date: Thu, 7 Jul 1994 08:02:02 GMT
Mime-Version: 1.0 (generated by vin2.0)
Message-ID: <k6w+T.e8710@etlrips.etl.go.jp>

On 07/07/94(01:28) pkohli@cc.gatech.edu (Prince Kohli) wrote
in <2vem3e$lhg@excepcion.cc.gatech.edu> (comp.parallel.pvm:2071/etlss2):
 |I had posted this before as a problem that I had with 3.2.2. However,
 |the same error recurs with pvm 3.3.1. Would any one know why this
 |happens? Any hint at all would be appreciated.
 |
 |Also, the following problem now occurs much later as compared to
 |pvm 3.2.2, but it does occur.
 |
 |----
 |
 |I have an application that runs on top of pvm 3.2.2 (3.3.1 now). The problem
 |is that at random times, i.e., sometimes very soon after the program
 |starts and sometimes much later, the host console will give this error:
 |
 |netoutput() timed out sending to <machine_name> after 23, 194.24140
 |hd_dump() ref 1 t100000 n <machine_name> ar "SUN4" lo ""
 |sa 130.207.114.58:3211 mtu 4096 f 0x0 e 0 txq 2
 |tx 65537 rx 0 rtt 0.003648
 |
 |The 130.207.114.58 is the address of <machine_name>.
 |
 |After this, though the pvm daemon is still running there, the master host
 |thinks it is dead and removes it from the config, and all later packets
 |from it are marked bogus packets. And all this of course screws up my
 |application.
 |

I have similar trouble resulted in the following message in pvml.<uid>:

     netoutput() timed out sending to ribm3i after 16, 183.005826
     hd_dump() ref 1 t100000 n "ribm3i" ar "RS6K" lo ""
               sa 150.29.246.3:2353 mtu 4096 f 0x0 e 0 txq 0
               tx 7 rx 7 rtt 0.238911
     netinput() bogus pkt from 150.29.246.3:2353

Another trial with the same code and the same workstation cluster
configuration (13 RS/6000s) but with different computational parameters
such as size of dimensions etc (smaller one) is always succesful.
The workstation cluster is homogeneous but the effective "speed" of some
of the workstations is 2 to 3 times slower than the others, because
the "slower" workstation is busy with other user's scalar calculations.

I also appreciate to receive any information with related this trouble.

Many thanks, in advance.

Yoshihiro Asai
Fundamental Physics Section
Electrotechnical Laboratory
Umezono 1-1-4, Tsukuba, Ibaraki 305
Japan

Email: e8710@etlrips.etl.go.jp




