Newsgroups: comp.parallel.pvm
From: mmestern@cs.uct.ac.za (Mark Mestern)
Subject: Help request with vm
Organization: University of Cape Town
Date: 8 May 1996 13:47:41 GMT
Message-ID: <4mq8lt$bet@groa.uct.ac.za>

Hi, 


I am attempting to do timing tests across pairs of machines in our
network. The program consists of a master which sends a number of
floats to a receiver which then sends the vector back again. This will
give me roundtrip times for the pair of machines. 

The problem I have is that certain combinations of machines the
program runs fine for a liitle while and then hangs.

With debugging information on this is an example of information that
the master prints when the program hangs.

---------------------------------------------------------------------------
libpvm [t200009]: mxinput() pcb t80023 fr_len=0 fr_dat=+0 n=16
libpvm [t200009]: mxinput() read=16
libpvm [t200009]: mxinput() pcb t80023 fr_len=16 fr_dat=+0 n=2788
libpvm [t200009]: mxinput() read=2788
libpvm [t200009]: mxinput() pkt src t80023 len 2788 ff 2
master: after recv bufid = 7
end of loop
begin of loop
master: before send
libpvm [t200009]: mxinput() pcb t80023 fr_len=0 fr_dat=+0 n=16 
---------------------------------------------------------------------------

And this is the section of its printout of a pvm_send when it doesn't
hang.

---------------------------------------------------------------------------
ibpvm [t200009]: mxinput() read=16
libpvm [t200009]: mxinput() pcb t80023 fr_len=16 fr_dat=+0 n=2788
libpvm [t200009]: mxinput() read=2788
libpvm [t200009]: mxinput() pkt src t80023 len 2788 ff 2
master: after recv bufid = 6
end of loop
begin of loop
master: before send
libpvm [t200009]: mxfer() dst t80023 n=4096
libpvm [t200009]: mxfer() wrote 4096
libpvm [t200009]: mxfer() dst t80023 n=4080
libpvm [t200009]: mxfer() wrote 4080
---------------------------------------------------------------------------

The difference appears to be that a mxinput is called when a send
occurs and it is only in this case that the program hangs.


This is what the segment of the slave (which bounces the message back)
looks like.


	cerr << "slave: before rec from  master " << endl;
	bufid = pvm_recv(-1, -1);
	pvm_bufinfo(bufid, (int*)0, (int*)0, &dtid);
	pvm_upkint(&size, 1, 1); 
	pvm_upkdouble(data, size, 1);

	pvm_initsend(ENCODING); 
	pvm_pkint(&size, 1, 1); /* problem lines */
	pvm_pkdouble(data, size, 1); /* problem lines */
	pvm_freebuf(pvm_getrbuf()); 
	pvm_send(dtid, 2);


The problem is even more mysterious for two reasons:

1 - Despite the problem apparently being with the sending phase of the
master, removing the lines in the slave marked as "problem lines" will
prevent the program from hanging. Of course then the whole message is
not returned and the timing info is not what I want.

2 - On another combinations of machines this does not occur at all.

The machines are all Sun SparcClassics and running SunOS 5.4. One
machine is a Sparc 10, but the problem occurs on that as well. I can't
see any significant difference between the cominations of machines
that work and those that do not. 

I've been battling with this strange behaviour for some time now -
unless I can fix or prevent these sort of things occurring then I
doubt i can use PVM 3.3.10 for my project. 

I'd really appreciate and advice or insights you might have into this
problem.

Thank you
Mark Mestern

P.S.

The only other info I can give that might be relevant is the tail of
the masters program as reported by truss as it hangs. 
---------------------------------------------------------------------------
xinput() pcb t80016 fr_len=0 fr_dat=+0 n=16
write(2, " x i n p u t ( )   p c b".., 44)      = 44
read(7, 0x0080B710, 16)         (sleeping...)
 
---------------------------------------------------------------------------


