Newsgroups: comp.parallel.mpi
From: Jaideep Ray <jaray@nubis.rutgers.edu>
Subject: Re: SIGSEGV-Error using MPICH
Organization: Rutgers Univ.
Date: 20 Mar 1996 22:41:25 GMT
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <4iq1il$4o1@dziuxsolim.rutgers.edu>

Matthias Linke <mlinke> wrote:

>We run MPICH on a cluster of RS6000-WS. Starting my executable using mpirun -v
>shows the following output.
>
>running /h07/tfb013/pvm3/bin/RS6K/MPIALL on 4 rs6000 p4 processors
>Created /h07/tfb013/pvm3/bin/RS6K/PI22293
>
>p0_26180:  p4_error: Found a dead connection while looking for messages: 1
>bm_list_21829:  p4_error: interrupt SIGINT: 2
>rm_l_0_16555:  p4_error: interrupt SIGINT: 2
>p2_14250:  p4_error: interrupt SIGINT: 2
>rm_l_0_24352:  p4_error: interrupt SIGINT: 2
>p3_26911:  p4_error: interrupt SIGINT: 2
>p1_18516:  p4_error: interrupt SIGSEGV: 11
>rm_l_779447913_13653:  p4_error: interrupt SIGINT: 2
>

	Hi !

	There are 2 things to check.

	* Run one of the test programs like pi3.f or cpi.c to see whether
	  your cluster's OK.

	* If it is, the fault is in your code. See if you're exceeding array
	   bounds or accessing memory which you haven't allocated, There's 
	   a SIGSEGV error - that's a segmentation violation. That might 
	   explain stuff like 
		bm_list_21829:  p4_error: interrupt SIGINT: 2
	   Once you have a seg. violation, all the 4 processors are sent a 
	   signal to interrupt the process (SIGINT). Signals are defined
	   in /usr/include/sys/signal.h (at least on the SGIs; might be 
	   different on other systems).


