Newsgroups: comp.parallel.mpi
From: llewins@msmail4.hac.com (Lloyd J Lewins)
Subject: Re: Any code to do Transpose of matrix on the SP2 using MPI?
Organization: Hughes Aerospace Electronics Co.
Date: Mon, 04 Dec 1995 10:48:04 -0800
Message-ID: <llewins-0412951048040001@x-147-16-95-58.es.hac.com>

In article <30BCC8B7.41C67EA6@hydra.cfm.brown.edu>, Wai Sun Don
<wsdon@hydra.cfm.brown.edu> wrote:

> Hi, I am new to MPI.  I am wondering if there exists any library code
> to do transpose of an NxM matrix distributed over P processors on the
> SP2? If so, where can I get a copy of  such code?
> thanks you 
> 
> -- 
> Wai Sun Don
> Visiting Associate Professor (Research)

2-D Matrix Transpose Using MPI

    For efficient execution of multidimensional signal processing
algorithms on typical distributed memory parallel machines, two issues
must be addressed:

    Firstly: The data-set must be distributed amongst the computational
processors, such that the data required by the next phase of the
computation is stored in the local memory of the computing processors.
This is necessary because the time it takes to access data stored in a
remote memory (even in a shared memory architecture) is typically several
orders of magnitude higher than the time it takes to access data stored in
the local memory. Note: in general it is not an adequate solution to
replicate the entire data-set in each local memory, because the entire
data-set is often too large to store in a single node's memory.

    Secondly: The piece of the data-set stored in local memory must be
laid-out in memory in a manner optimal for the next phase of the
computation. This is necessary because modern memory hierarchies (caches,
page-mode DRAM, etc.) provide very much higher performance when accessed
in an optimal pattern, versus when accessed with a pathological, or even
random pattern.

    For example, consider the well known algorithm for performing a two
-dimensional FFT (N x M), by first performing N one-dimensional FFTs along
the rows of the data-set, and then performing M one-dimensional FFTs down
the columns of the data-set. During the first phase of this computation,
the optimal layout of the data-set will be to distribute the N rows evenly
amongst the P available processors. In addition, if a processor has more
than one row, the elements of each row should be stored in contiguous
memory locations.

    During the second phase of the 2-D FFT computation, the optimal layout
of the data-set will be to distribute the M columns evenly amongst the P
available processors, with the elements of each column stored in
contiguous memory locations.

    Thus, between phase 1 of the 2-D FFT, and phase 2 of the FFT, the
data-set must be redistributed amongst the P processors, and the layout of
the data in local memory must be changed from row-major order to
column-major order. This operation is know as "corner turning".

    When the number of rows and columns is an even multiple of the number
of processes, the MPI function MPI_Alltoall can perform exactly this
redistribution, and re-layout in a single call. This is achieved by using
custom MPI datatypes to specify row-major order at the source, and
column-major order at the destination.

    In more detail, let us consider corner turning a 16 x 8 matrix of
COMPLEX data elements, distributed over four processors. We will assume
that the matrix is laid out initially as described above for phase 1 of a
2-D FFT, e.g.:

    Process 1:
      D[1,1], D[1,2]...D[1,8], D[2,1], D[2,2]...D[2,8]...D[4,1], D[4,2]...D[4,8]

    Process 2:
      D[5,1], D[5,2]...D[5,8], D[6,1], D[6,2]...D[6,8]...D[8,1], D[8,2]...D[8,8]

    etc. for process 3 and 4

After the corner turn, the data should be distributed as follows:

    Process 1:
      D[1,1], D[2,1]...D[16,1], D[1,2], D[2,2]...D[16,2]

    Process 2:
      D[1,3], D[2,3]...D[16,3], D[1,4], D[2,4]...D[16,4]

    etc. for Process 3 and 4

    To perform this operation using MPI, we must first create two matrix
datatypes, one describing the data-layout, and extent, of a source
subarray, which will be sent to every other node, and the other describing
the data-layout, and extent of a destination subarray, which will be
received from each node.

    In the above case, the source subarray will consist of two columns,
and four rows, and the extent will be equal to the size of two COMPLEX
elements. The destination subarray will consist of four columns, and two
rows, and the extent will be equal to the size of four COMPLEX elements.
The following code fragment will create the necessary MPI datatypes:

    typedef double complex[2];
    MPI_Datatype COMPLEX;
    MPI_Datatype temp;
    MPI_Datatype sourceSubarray;
    MPI_Datatype destColumn;
    MPI_Datatype destSubarray;
    complex srcbuf  [16][8];
    complex destbuf [8][16];
    
    /* First create a COMPLEX datatype (not predefined in C!) */
    MPI_Type_contiguous (2, MPI_FLOAT, &COMPLEX);
    
    /* Now, create the source subarray - Four rows, each consisting of
       two columns. Note, the stride between rows takes into account the
       size of the whole row in the source array (8 columns). */
    MPI_Type_vector (4, 2, 8, COMPLEX, &temp);
    
    /* Then set the extent of the subarray so that it strides across
       the destination array */
    blens[0] = 1;
    blens[1] = 1;
    displ [0] = 0;
    displ [1] = 2 * sizeof (complex); /* skip to next two columns */
    types[0] = temp;
    types[1] = MPI_UB;
    MPI_Type_struct (2, blens, displ, types, &sourceSubarray);
    MPI_Type_free (&temp);
    
    /* To re-layout the incoming data, we must create the destination
       data structure so that it iterates along columns before rows.
       Thus we must use a different construction from that shown above!
      
       First create a 2-element column vector type */
    MPI_Type_vector  (2, 1, 16, COMPLEX, &destColumn);
    
    /* Then replicate the column vector 4 times to create the
       destination subarray. */
    MPI_Type_hvector (4, 1, sizeof (complex), destColumn, &temp);
    
    /* Finally set the extent of the subarray so that it strides across
       the destination array */
    blens[0] = 1;
    blens[1] = 1;
    displ [0] = 0;
    displ [1] = 4 * sizeof (complex); /* skip to next two four */
    types[0] = temp;
    types[1] = MPI_UB;
    MPI_Type_struct (2, blens, displ, types, &destSubarray);
    MPI_Type_free (&temp);
    
    /* Now we can use MPI_Alltoall to actually transpose the matrix */
    MPI_Alltoall (srcbuf,  1, sourceSubarray, 
                  destbuf, 1, destSubarray, comm);

--------------------------------------------------------------------------
Lloyd J Lewins                                  Mail Stop: RE/R1/B507
Hughes Aerospace and Electronics Co.            P.O. Box 92426
                                                Los Angeles, CA 90009-2426
Email: llewins@msmail4.hac.com                  USA
Tel: 1 (310) 334-1145
Any opinions are not neccessarily mine, let alone my employers!!

