Newsgroups: comp.parallel.pvm
From: knut@vltava.uni-paderborn.de (Knut Menzel)
Reply-To: knut@uni-paderborn.de
Subject: PVM Bug Report for Shared Memory Architectures
Keywords: PVM, SGIMP, SUNMP, Shared Memory
Organization: University of Paderborn
Date: 5 Sep 1994 08:43:51 GMT
Message-ID: <34elo7$l21@news.uni-paderborn.de>

Dear colleagues,

we assume that there is a bug in the shared memory version of 
PVM Release 3.3.2 and 3.3.3. The information is given in the
Bug Report Form below. We have sent the Bug Report Form also
to the Oak Ridge National Laboratory at pvm@msr.epm.ornl.gov.

We appreciate any help concerning this problem. Please send
comments also to the email address below. 

With best wishes,
-- Knut Menzel

--------------------------------------------------------------------------
 Knut Menzel                     University of Paderborn     
 email:   knut@uni-paderborn.de  Department of Computer Science FB17     
 office:  E3.128                 Paderborn Center for Parallel Computing   
 Tel:     +49 5251 603325        Warburger Str. 100     
 Telefax: +49 5251 603436        D-33098 Paderborn     
--------------------------------------------------------------------------

==== Cut here ============================================================

PVM version 3 Bug Report Form
08 June 1994

* The exact version, patch level you're using (e.g. 3.3.0).

  PVM Release 3.3.2 and 3.3.3 for shared memory architectures

* The machine type(s) you're using, hardware and software (e.g. DEC
  3000/500 running OSF 1.2).

  1. Sun Sparc Server 1000, 8 processors, 256 MB main memory.
  2. SGI/Onyx, Deskside, Reality Engine 2, 4 Processors R4400, 256 MB main memory
   
* The machine architecture PVM chooses for you (e.g. ALPHA).

  1. SUNMP
  2. SGIMP

* A short description of the problem (what happens, when, etc.).
  Include error messages (don't edit them to summarize).

  Situation: There are 3 processes given, one master process and two slave
 	processes. The master sends the problem to the slave, the slaves
	solve that problem while communicating another. During this
	computational phase the slave sends status reports back to the
	master.

  Communication: The number of messages are thousands per second (> 4000).
	The size of one message are between 8 and 1000 Byte.

  What happens: During the computation the following error messages occured
	in the pvmt.<userid> file:

	The first error is always one of this kind:
	"ref = -1 on page <integer>"  ; pvmshmem.c, line: 188

	After this the above errors with differen <integer> occur and
	error messages of this kind:
	"peer_send(): outgoing buffer full\n" ; lpvmshmem.c, line: 745

	Either the program runs until the pvmt.<userid> file increases the 
	disk limit, which is about several hundreds of Megabytes. Or the
	program crashes due to the described errors.

  Debugging the software: Debugging the software with standard Unix debuggers
	showed that messages vanish, i. e. they are sent by the master but
	never received by the slaves. In addition we assume that messages 
	already sent but not received are partially overwritten by other
	messages. 

  Delaying of messages: Executing the software the program often crashes 
	after several seconds. Inserting a delay system call the program
	runs for more than 30 minutes.

* A record of what you did to make it happen.

  Compiling the software package with Gnu gcc, or Sun/SGI cc.
  Let it run with 3 processes (1 master, 2 slaves)
  The behaviour is independent of the debugging settings of the PVM server.	

* General comments

  The same errors also occur with a self designed distributed raytracing
  software on Sun Server as well as on SGI/Onyx with more than 4 processes.
  The above description of the situation, communication and results are
  the same.

  We don't think that this is a problem of the operating system, since the
  errors occur on both architectures SUNMP and SGIMP.




