Newsgroups: comp.parallel.mpi
From: solt@cs.uiuc.edu (David Solt)
Subject: Re: Performance problems using mpi on the SGI Power Challenge
Organization: University of Illinois at Urbana-Champaign
Date: 14 Oct 1996 21:47:50 GMT
Message-ID: <53uce6$9nq@vixen.cso.uiuc.edu>

In article <53pe0e$629@murrow.corp.sgi.com> salo@mrjones.engr.sgi.com (Eric Salo) writes:
>> I am working on my own implementation of MPI on the SGI Power
>> Challenge.  My approach is to continue to use gang scheduling, but to
>> allow processes to cooperate in data transfer.  Our initial tests show
>> that this works well in most circumstances.  Any data transfers over a
>> constant (MY_MPI_WORK_UNIT :)) are split up into chunks (size of
>> chunks depends on the number of available spinning processors) and
>> placed in the work queues of any spinning processors.
>
>I don't see how this could work unless your implementation was multi-
>threaded. Is it? And are you ready to publish any hard numbers yet? It
>sounds interesting...

The processes are spawned with sproc.  I believe this is how your MPI
works on the Power Challenge?  I use sproc without sharing the same
address space.  Then each "process" attaches every other processes'
stack somewhere in its own address space.  The heap is shared by
intercepting calls to malloc and making them calls to usmalloc.  When
sending data, the address of the data is checked.  If a process is
sending a global or constant address, it is first copied into a shared
area (such as the heap).  If the address is in a stack region, other
processes access it by using the address where they attached that
process' stack to.  Finally, if the address is global, it is used
directly.  I assume that you do something similar to this, since I was
told that you do single copy data transfers as well and for tests with
large data sizes our MPI's take the same time.

So, I don't know if the SGI Community refers to these types of
processes as "threads" or not, since they do not completely share a
single address space such as the pthread routines, but I think that
since they coordinate in much the same way as threads (with locks,
semaphores, barriers) the answer is yes.

As for progress... currently I have little implemented.  Most of the
point-to-point routines however.  I have not worked on buffered sends
or sendrecv.  I am just now starting to venture into collective 
routines and how this same approach can be applied there.  Only 
MPI_COMM_WORLD is available right now and only basic data types.  I hope
to steal much of the non-communicating calls from mpich if we ever decide
to put a real package together.  

As for numbers.... I am working on two last things before publishing
anything.  One is to see how this works in conjunction with the
different combinations of gang vs individual scheduling and blocking
vs. spinning.  In this context, it is not even clear when a good time
to block is... the obvious is when you are waiting for a send or recv
to be posted and your work queue is empty, but it is not clear if a
blocked processes should be counted as available for doing work or
not.  The second thing is to determine the best way to set the
variable I described above as MY_MPI_WORK_UNIT.  Currently I have
hand tuned it for various benchmarks, but some work is necessary to
determine a good fixed value or how to determine it dynamically. 

Any ideas of good benchmarks to test things out on?  Of course the
problem now is that I have a very limited subset of MPI to work with.

Dave Solt
 

