Newsgroups: comp.parallel.pvm
From: zrcc0100@nfhsg2.rus.uni-stuttgart.de (Manuela Sang)
Subject: Re: program hangs when running on many host.
Organization: Comp.Center (RUS), U of Stuttgart, FRG
Date: 27 Oct 1994 10:25:34 GMT
Message-ID: <38nv6u$hoa@info2.rus.uni-stuttgart.de>

Hi Fazilah,

In article <1994Oct26.131729.22390@leeds.ac.uk>, fazilah@scs.leeds.ac.uk (F
Haron) writes:
 
> I'm running a master/slave program that does an addition of real numbers. 
> The
> program works inconsistently, ie
> 
> 	i) homogeneous platform - I have no problem running on a network of
> 	 2 to 16 SGI's w'stations (haven't tested more than 16)
> 
> 	ii) shared memory multiprocessors - I *sometimes* have a problem 
> 	running it on the SGIMP (with 8 processors).  It hangs even with 4 
> 	slaves.  There are times when it gets through smoothly with 8 or even 
> 	9 slaves.
> 
> 	iii) heterogeneous platform (SGI+SGIMP) - It works only for up to 4
> 	hosts (1 SGIMP & 4 SGIs) and hangs with more host.
> 
> Can some one tells me what causes this problem and how to overcome it?

If it is a problem caused by the SGIMP system, I cannot help you.

But I think it might also be a bug in your program. I think it might be a
kind of race condition (in this case you would have of course a quite different
behaviour on different machines AND with different numbers of processors
dependent on network behaviour).

I mean, if a specific message arrives unexpectedly before another, a program
hangup could be the result. On one machine (SGI) this condition might never
occur because of the special network abilities. In another machine (SGIMP)
this case appears sometimes, but the more often the more processors you use.
In a heterogenous system, the network behaviour changes even more, so the
problem might get even worse.

I suggest you check if this is possibly your problem. You can do so by taking
a closer look to the last messages before the hangup (are they really the
messages you wanted to receive in this place?).

I hope I could help you.

Bye. Manuela.

