Newsgroups: comp.parallel.pvm
From: manchek@thud.CS.UTK.EDU (Bob Manchek)
Subject: Signals and Threads and PVM (long and somewhat rambling)
Organization: Computer Science Dept, University of Tennessee, Knoxville
Date: 16 Feb 1995 19:14:45 GMT
Message-ID: <3i0875INN3l7@CS.UTK.EDU>


Since there has been a lot of talk about signals...

I'm interested in making libpvm thread- and signal-safe, but there are
a number of decisions to be made and implemetation issues, not the
least of which is keeping the resulting code portable.

Thread safety means that any thread in a multi-threaded program would
be able to call libpvm functions with predictable results, which
doesn't work right now (version 33x).  Signal safety means that you
could catch an interrupt and call libpvm functions in the handler
function, which you shouldn't really do right now.

It's possible to work around most places where you'd want to use
interrupts, but the solution is often ugly.  [ Don't forget that for
simple purposes, such as multiplexing input you can get by by using
pvm_getfds() or pvm_trecv(). ]

The pvm_sendsig() function wasn't built for general use.  The problem
is that you can send a signal, but can't safely do much with it except
set a flag or call exit or something like that.  About the only thing
it's good for is sending SIGTERM to a task, which is what pvm_kill()
does.  There's also the problem of mapping signal numbers between
machines; pvm_sendsig() doesn't attempt to do it, we just assume they
match.  Fortunately, INT is 1 and KILL is 9 and TERM is 15 anywhere I
can think of.

There are two main issues that have to be dealt with to make PVM
multi-thread safe:  Global state and reentrancy.  Global state includes
simple things like the last error code and more complex things like the
PVM message buffer heap.  Reentrancy means being able to call a
function again while you're already in it.

Global state can be reduced by passing more parameters to certain
functions, e.g. doing away with the "current" send and receive
buffers.  Per-task options such as error handling aren't as frequently
modified as individual message buffers, and so might be left global.
This will result in one or two parameters being added to some
functions, pack(), unpack(), send().  I propose to create a few new
functions, keeping the old ones, but implemented in terms of the new
ones (no, they wouldn't rust and fall off immediately).  Or, one
could tie global information to a thread identifier, making it global
in the context of a thread, and preserving the current programming
model.  While this makes simple code look cleaner, I think the right
way to go eventually is remove any global state.

Reentrancy is a tougher nut to crack.  First, consider just signals and
not threads.  The obvious solution is to block signals around each PVM
function (this includes pack and unpack).  If you really need it, you
can do this yourself [ by running the libpvm source through "sed" ] and
things will just work.  Of course, you won't be able to receive a
singnal if you're blocked in send or receive, which might well be what
you need.  In addition, constantly blocking and unblocking signals is
very expensive in some versions of unix.

We would like to build a solution into libpvm.  If we get rid of global
data, and use a global "I'm in a PVM function" flag as Phil
(phil@msr.epm.ornl.gov) just suggested, we can let signals happen at
any time.  PVM functions called in a signal handler would then be able
to take special action if necessary, such as queuing work to be done
later when control returns to the normal context.  I think this applies
mainly to sending messages.  The protocol driver is not (of course)
reentrant.  If you are in the middle of sending a packet and start
sending another, things will get botched.

But I don't unix signal handlers are the right way to go.  Signal names
and semantics are not the same on all machines.  What happens if
someone links with, e.g., the optional system-5 libc instead of the
BSD-almost-compatible one?  Do we handle all those cases?  Although
SIGUSR1 is available almost everywhere, it's not on some non-unix
machines, and it's only one signal anyhow.  Also, signals are not
reliably delivered - repeat occurrences can be droped if they're not
serviced quickly enough.  And, many functions in libc used by PVM or an
application program are still not signal-safe, so you can't go calling
them in signal handlers.  Finally, signal contexts are not a great
place to do lots of computing (which is what happens when you start
using big libraries).  In signal context, you might have a short stack
area with no automatic extension, or other problems that you don't
have in a normal context.

The point is, using signals is ugly can of worms to open up, and I don't
think it buys us much.

If people want to write interrupt-driven PVM programs, I think it might
be wise to provide a more general and well-defined mechanism.  For
example, one could mark certain message tags as interrupt messages and
bind functions to be called when an interrupt message arrives at the
task.  [ This is how the libpvm control messages work right now, but
they don't really _interrupt_ the task and aren't accessible to user
programmers. ]  This would allow the programmer to create an arbitrary
number of interrupt classes, and service them with functions written
like normal PVM code.  To really interrupt the task, PVM would probably
have to use something like SIGUSR1 or SIGIO to be notified to receive
the interrupt message, but that would hopefully be hidden by the
implementation.  Message-handling code in libpvm would be signalled
when an interrupt message became available, and would download the
message in a normal context and then call the message handler
function.  But, there's still the problem that the program may be busy
somewhere other than in libpvm.  In that case, the only choices are to
either call the handler function in the signal context, or wait to
deliver it until the next libpvm function is called.

One question I have for the general public is how one would use this
facility, other than for really simple applications such as a "bail out
on command" message or a remote memory fetch.  Are your own data
structures going to be tough enough to be mooshed around by reentrant
code?  What about memory leaks?

So, that brings us to threads.  Instead of interrupt message handler
functions, what about a more general scheme such as interrupt handler
threads?  On receiving an interrupt message, libpvm would thork a new
thread to handle it.  The message-marshalling thread in libpvm would
already exist (from the first call to install an interrupt handler
or before that) to create it.

We could remove global state from libpvm, and build a thread
dispatching mechanism.  But we have to have a thread package portable
enough (or make it removeable enough) to function on any of the
machines on which PVM can run.  I hope we could do this by making a
thin glue layer that would interface between PVM and any thread
package, giving us access to the few lock, thread-create, etc.
primitives that we need.  Last I heard, Honbo Zhou at ORNL was trying
to figure out how to do this.

Then, there are other questions such as:  What should the semantics of
the thread dispatcher be?  Do you allocate a single thread to handle a
message class (another doesn't start until the previous one terminates)
or or do you start as many as you receive messages.  I haven't thought
much about this yet.

-b

-- 
/ Robert Manchek                University of Tennessee     /
/                               Computer Science Department /
/ (615)974-8295                 Ayres Hall #104             /
/ manchek@CS.UTK.EDU.           Knoxville TN  37996-1301    /
/     http://www.netlib.org/utk/people/BobManchek.html      /



