Newsgroups: comp.parallel.pvm
From: gross@noether.UCSC.EDU (Mike Gross)
Subject: Re: pvm slave startup failures
Organization: University of California, Santa Cruz
Date: 8 Apr 1995 19:33:15 GMT
Message-ID: <3m6odr$8jd@darkstar.UCSC.EDU>

In <3lv8ov$f7c@grover.jpl.nasa.gov> sam@kalessin.jpl.nasa.gov (Sam Sirlin) writes:

>I'm running (fortran) pvm on a sun 4.1.3 network using a master/slaves
>topology. The slaves are large though, maybe 20 meg. What I'm seeing
>is that if too many slaves try to startup on one machine, it seems to
>require too much memory and just hangs. There are various fixes I
>could do, such as restrict the number of spawnings, etc, but is there
>some way to at least make the failure more benign?

A few ideas:

(1)  Run smaller cases while you're debugging.  E.g. cut your largest arrays
     in half.  You didn't say what your application was, and I'm presuming it
     scales in a predictable manner--I debug my Poisson equation solver by using
     a very small density/potential array, such as 10x10x10.  That means I can
     easily run all on the same host if I need to.

(2)  Run fewer slaves.  You already thought of this one.

(3)  Nice the *HECK* out of your slaves.  You can do this most quickly by
     hacking the debugger scripts and globally changing "dbx" to
     "/usr/bin/nice -10 dbx".  Use a full pathname since some shells have nice
     as a builtin, and it has different syntax.  I suspect your lockup is not
     really a problem with memory, but with your many jobs hogging all the
     resources (in particular, the swapper), at the expense of the shell.

>This problem also
>seems to brake the debugger, since there you have to run on one
>machine. Purify is able to run though.

It sure seems to break Sun's dbx (which has no process management, at least
under 4.1.3).  Use gdb.  Then, you can run all your slaves on different hosts,
log into each one (yuck), and run the debugger on an already running process.
The syntax is "gdb executable pid", where the pid can be found from the output
of "ps augxww|grep $USER".  If you start your master program under the debugger,
you can keep all your slaves waiting for input via pvmfrecv while you start all
the debuggers.  I do this regularly, with five hosts in the pool.

An alternative:
If you are running X11, write short shell script containing the following
command line, and spawn the script instead of your slaves (don't forget to
enable execute permission for yourself on the shell script).

xterm -n `hostname` -T `hostname` -display display -e dbx executable

where "display" is the X server you are running on.  You will have to set
"xhost +" for each host you are running a slave on (or you can use xauth,
especially if your home directory is shared between all of the hosts).  Do not
turn off all X authentication because it can be a serious security hole.  The
-n and -T options are not required, but they set the names of the xterm window,
so that you can tell them apart.

Good luck.

Mike Gross
Physics Dept.
Univ of California
Santa Cruz, CA 95064
gross@physics.ucsc.edu

