Newsgroups: comp.parallel.pvm
From: robinson@hobbes.par.univie.ac.at (Guy Robinson)
Subject: Re: PVM out of memory
Organization: University of Vienna
Date: 14 Aug 1995 17:16:18 GMT
Message-ID: <40o0d2$aq0@osiris.wu-wien.ac.at>

In article <DDB2y8.8o3@hpcvsnz.cv.hp.com>, jking@cv.HP.COM (Jonathan King) writes:
|> I'm currently debugging a large parallel application ported to PVM, and am running across 4 nodes.  The program is very long running (esp. on workstations), and eventually (could be a couple of days into a run) PVM dies with the error:
|> 
|> libpvm [t4000d]: fr_new() can't get memory
|> Segmentation Fault - core dumped
|> 

It might be useful to see if the application demands of memory are creating up
with time. I suggest there could be an unread message somewhere in the code. EACh
iteration this little bit of memeory gets left behind. Having not been read its
still buffered. 

Symptoms of this are 

Does each run fail at the same iteration. When you run on fewer processors what
happens, this will depend on your program and the messages it exchanges, you
should be able to think it through. 

If you run two processes per node does it fail eariler in the iteration/sequence.
If it does this might suggest each node has a left over message. If its only
slightly earlier then it could be only one node. 

A collegue of mine once implemented a global reduction that had this problem.
Each processor sent the result to the host, performed test and returned result.
This was then converted to run without a host. However the local value was used
to start the reduction and the loop reduced. However the send remained. After
about 24 million global tests there were enough outstanding messages for he code
to die. 


Should add this to comp.parallel.folklore along with superlinear speedup.

Hope this helps. 


|> Now, each node has AT LEAST 128 Meg of ram on board, + 1 gig of swap, and the program, in any one iteration should not be coming anywhere CLOSE to that number.  But, since it is a very long running process, if there is some memory that PVM doesn't free fr|> om old messages or something like that, the used memory could be creeping up.  Has anyone else had any problems of this sort?  The program works fine up to the point, results look good, but this is a frustrating debug.  It has died both on HPs and SU
Ns, w|> i
|> th both 3.3.7 and 3.3.8.
|> 
|> Any clues?
|> 
|> 'preciate it.
|> Jon King
|> jking@cv.hp.com

