Newsgroups: comp.parallel.mpi
From: dbs@masc.rice.edu (David Brian Serafini)
Subject: MPI/CH bug in ALL_REDUCE with user-defined operator?
Organization: Dept. of Math. Sciences, Rice University
Date: 16 Jun 1995 03:26:30 GMT
Message-ID: <3rqtl6$lta@larry.rice.edu>


I'm trying to use MPI/CH 1.0.8 on an SGI Power running Irix 6.0.1 & 6.0.2
(64bit OS).

The problem I have is with MPI_OP_CREATE and MPI_ALLREDUCE.
I'm programming in Fortran.  The MPI standard (May94) says that the
Fortran interface to OP_CREATE should use an integer to store the 
MPI_Op handle.  I figured out that it needs to be integer*8 in
IRIX64, but I'm still getting an error from ALLREDUCE:

> 0 - Error in MPI_ALLREDUCE : Invalid operation
> Aborting program!
> (null) 11
> p2p_error is not fully cleaning up at present
> Process 27621 (pdsappmpi-dbg) terminated


The call that fails is:

      call MPI_ALLREDUCE( MYBEST ,YOURBEST ,SIZEE,MPI_DOUBLE_PRECISION
     &                   ,PDSWAPOP ,MPI_COMM_WORLD ,INFO )

where PDSWAPOP is the integer*8 variable that was set by:

      call MPI_OP_CREATE( PDSWAPFCN ,.TRUE. ,PDSWAPOP ,INFO )

I've run this in the debugger and I can see that after calling OP_CREATE,
PDSWAPOP has a non-zero value that looks like a 32bit pointer (upper 32bits are
zero), and that the value passed into ALLREDUCE is the same as the value
returned by OP_CREATE.  I've stepped into the ALLREDUCE code, but I'm not much
of a C hacker so I haven't been able to figure out what's going wrong.  It
seems that somewhere in the process of following pointers and doing table
lookups it comes up with a zero value, and I think that's what it is
complaining about.  The original code, with the integer*4 declaration for
PDSWAPOP had the same error, but PDSWAPOP really was 0 because it only got the
upper 4 bytes of the actual value.

This exact same code (without the integer*8 declaration) works correctly on the
IBM MPI on the SP2. 

This has failed on 2 different SGI/MPI-CH installations at different places.
I didn't build either one.  I'm just trying to use it.  
I don't usually blame my bugs on the system software, but in this case I can't
think of any other explanation.  

Anyone have any ideas?  Has anyone used OP_CREATE successfully in Fortran on a
64bit machine?  Should I just find another implementation?

I'll try to keep up with the newsgroup, but please email me any responses
just to make sure I see them.

Thanks in advance,

David   <dbs@caam.rice.edu>

-- 
David B. Serafini           <dbs@masc.rice.edu>           (713)527-8101 x2855
Rice Univ., Computational & Appl. Math Dept,  P.O.Box 1892,  Houston TX 77251
Computer Sciences Corp., NASA/Ames Res.Ctr. MS-T27A-1, Moffett Field CA 94035
   <The views expressed do not represent those of Rice U., NASA or CSC.>

