Newsgroups: comp.parallel.mpi
From: gdburns@osc.edu (Greg Burns)
Subject: Re: Installing LAM on SGI IRIX 6.2
Organization: Ohio Supercomputer Center
Date: 28 Jul 1996 08:08:47 -0400
Message-ID: <4tfl8g$94r@tbag.osc.edu>


In response to Steve Simonds:

>I made the changes that eric Salo suggested and then undid all that and applied
>the patches and rebuilt lamd.  All is good.

...not to imply that Eric's changes are wrong.  They are exactly right.
The patch might be preferable in case there are further patches to
those files.

>When I run my program it just hangs reading from stdin.  The job read a
>parameter file from stdin then starts running.  I am hung in the read so it
>never starts.   I have tried 
>  mpirun n0 salvo < salt.prm - no luck.
>  mpirun n0 -i <path>/salt.prm - no luck.
>
>Is there problems trying to read from stdin?  I seep to print OK to stdout.

I suggest a careful reading of mpi(7), tstdio(3) and mpirun(1) manual
pages to understand the I/O possibilities under LAM.  The -i option,
for example, only affects LAM's remote POSIX I/O library.  It does not affect
direct UNIX input.

>Also, the window where I issue the command mpirun seems to be hung up.  I can
>type a <cr> and get a prompt but if I type a command it runs very slowly.  any
>lam command, like lamclean, mpimsg, etc, just hangs.  If I do a lamclean from
>another window the first window is now OK - as if the job is trying to read
>input from the command line.

LAM remote POSIX I/O does not permit terminal reads in 6.0.  If you are
reading the terminal (from which mpirun was executed) via UNIX fd 0,
you must use the -w option on mpirun, otherwise it is like reading the
terminal from a background job (ignoring SIGTSTP).

Do not assume LAM commands hang when they don't return in a couple of
seconds.  Most LAM/MPI commands poll the whole multicomputer without
further options and if you are crunching on one or more of those nodes,
it can be slow.

>If I run the job without mpirun it appears to work - it gets started, reads the
>input, prints some output then runs forever in a large mathematical loop.

In 6.0, the cwd for both direct local UNIX and remote POSIX I/O is the
directory where you typed lamboot, not the directory where you typed mpirun.
This is not too annoying if you are running a single job (cd before
lamboot, not after) but it is a shortcoming we intend to remove.

Again, see the aforementioned manual pages and the LAM/MPI document,
"MPI Primer / Developing with LAM".

>when I open a file from within an MPI thread, where does the system expect that
>file to reside?  i.e. what is the full default path to that file?

This is the cwd issue.

>I'll start a new thread.
>this all involves LAM 6.0 with 15 patches

>Now I get the following error message -
>
> MPI_Isend: internal MPI error: GER overflow (rank 0, comm 0) 
>
>What is GER?

See the mpi(7) and mpirun(1) manual pages or the LAM/MPI document.
GER is Guaranteed Envelope Resources, and the error check prevents
you from configuring a job with enough processes that might blow up
for a mysterious lack of message envelope (not data) buffer space.

>I tried increasing buffer size with bfctl but that has no effect.  In fact it
>doesn't change the value.

mpirun automatically adjusts the lamd buffer space according to your
job configuration.

>Then I run the job with mpirun and do a bfstate -l n0 and it's back to default.
>NODE      DEST      EVENT    TYPE     LENGTH
>n0 (o)    max space = 8388608, used space = 0

>It appears that mpirun resets the buffer space ?? Is that true?

mpirun sets the buffer space to max(GER requirement, default).

>If so, how can I increase the buffer space, assuming that is the cause of the
>MPI_Isend error.

A loop of Isends can potentially break any GER if they are not received
fast enough.  (You can also run out of MPI_Requests, a more obvious
system resource error.)  If you want to keep the GER protection, you
have to increase the GER.  In 6.0, this means changing the MPI_GER macro
in the Config/config file and rebuilding LAM.  If you want to fly
with GER, use -nger with mpirun.

-=-

Our biggest support problem is user confusion over I/O.  LAM 6.1
will likely have major new I/O capabilities that will eliminate this
confusion and take stdio services in a distributed host environment
to a new level.

Ohio Supercomputer Center also distributes a portable (written in MPI)
parallel I/O library for MPI (used with MPI objects: communicators
and datatypes).  It is called MPI Cubix and is available from our
web site.

-=-

You can also send questions about LAM, its related tools, or MPI Cubix I/O
to lam@tbag.osc.edu

-=-
Greg Burns				gdburns@osc.edu
Ohio Supercomputer Center		http://www.osc.edu/lam.html

