Newsgroups: comp.parallel.mpi
From: sanders@ira.uka.de (Peter Sanders)
Subject: Re: MPI_Alltoall() on Cray T3D
Organization: Universitaet Karlsruhe, Germany
Date: 18 Nov 1996 17:11:00 +0100
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Message-ID: <t1rg227v0x7.fsf@i90s25.ira.uka.de>

>   > I guess I'd really appreciate if somebody can mention the exact way the
>   > schedule is generated.
>
> OK. Imagine the processes within a communicator in a ring. Every process 
> tells
> it's immediate neighbour to the right where it would like it's neighbour to
> write the data. It then writes data to the correct place on its left 
> neighbour.
> Repeat this n times, stepping one process further round the ring each step.
> No attempt is made to avoid collision on links, only to minimise the number 
> of
> processors writing to a particular processor at any step. We experimented 
> with
> several options for the collective operations, and found that the cost of
> synchronisation associated with some of the more "optimal" algorithms
> significantly outweighed any link contention overheads.

These ring based algorithms sound OK for large messages or small P (P
= number of processors).  But what about short messages? There are
quite simple algorithms using hypercubic communication patterns which
need only log P message exchange phases per processor.  (It might turn
out a little bit more difficult for general P though.)

For example, this would be crucial for the previously
discussed cases where we need a length one MPI_Alltoall
in order to initialize the arrays needed for an MPI_Alltoallv.

Are there any implementations with an optimized version of
Alltoall for short messages?

Regards,

Peter Sanders

University of Karlsruhe

