Newsgroups: comp.parallel
From: prins@cs.unc.edu (Jan Prins)
Subject: Re: Parallel Prefix
Organization: The University of North Carolina at Chapel Hill
Date: 2 Apr 1995 14:45:17 -0400
Message-ID: <3lrkgd$am6@usenet.srv.cis.pitt.edu>

Robert van de Geijn <rvdg@cs.utexas.edu> wrote:
        >The parallel prefix can be illustrated as follows:
        >
        >Before:
        >
        >let x = ( x_0 , x_1 , ... x_p-1 )
        >with x_i assigned to processor i.
        >
        >After
        >
        >let y = ( y_0 , y_1 , ... y_p-1 )
        >with y_i assigned to processor i.
        >with y_i = y_0 + y_1 + ... + y_i
        >
        >We have no problems identifying applications for this.  However,
        >consider the case where x_i and y_i are all vectors, and this
        >operation is done simultaneously, ELEMENTWISE.  [...]
        >
        >We would very much appreciate references to applications
        >as well as implementations.

A nice use of the elementwise parallel prefix operation on vectors occurs
in parallel radix sort.  Suppose we choose the radix to be 256, so that
each iteration in the radix sort arranges the inputs according to their
value in an 8-bit field.  Each processor looks through its local values
and counts the number of occurrences of each "digit" between 0 and 255,
yielding a vector of 256 counts in each processor.  To convert this local
count into a global destination for each value we must compute the
parallel prefix sum of each of the 256 vector elements across the
processors (and broadcast the sum in the last processor to adjust the
prefix sums so that all "1" digits follow all "0" digits, etc.).

This radix sort algorithm is described and implemented in Blelloch et al.
"Comparison of Sorting Algorithms for the CM-2" (SPAA 91) and also used in
Blelloch and Zagha, "Radix Sort for Vector Multiprocessors"
(Supercomputing '91).  Both of these papers can be obtained via
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/scandal/public/www/alg/sort.html

Multiple simultaneous parallel prefix operations of this sort can be
implemented exactly the same way as a global vector sum with the attendant
performance improvements over the element-at-a-time parallel prefix
operation.  We have used these techniques on MasPar machines in the
implementation of radix sort, and to implement parallel prefix operations 
on large vectors that are decomposed over processors following a cyclic
data distribution.

Jan Prins
http://www.cs.unc.edu/~prins/

