Newsgroups: comp.parallel.mpi
From: ZJ <zjw@cfdrc.com>
Subject: Re: Help needed to run MPICH on a cluster of SGI workstations
Organization: CFD Research Corporation
Date: Mon, 18 Mar 1996 14:43:06 -0600
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <314DCADA.41C6@cfdrc.com>

Jaideep Ray wrote:
> 
> ZJ <zjw@cfdrc.com> wrote:
> >Hi, there
> >
> >I have a cluster of SGI workstations and had no problems in configuring every  machine.
> >I ran cpi successfully on each machine with several processes.  However, I could not run
> >a program with different machines.  I successfully set up the .rhosts file so that I can
> >use rsh to login in the workstations in the cluster.
> >
> >When I run tstmachines, I got
> >
> >
> >Errors while trying to run ls /usr/people/p4815/mpich/bin/machines/foo
> >Unexpected response from glasgow.cfdrc.com:
> >--> UX:ls: ERROR: Cannot access /usr/people/p4815/mpich/bin/machines/foo: No such file
> >or directory
> >Unexpected response from glasgow.cfdrc.com:
> >--> UX:ls: ERROR: Cannot access /usr/people/p4815/mpich/bin/machines/foo: No such file
> >or directory
> >
> >2 errors were encountered while testing the machines list for sgi
> >Only these machines seem to be available
> >    china.cfdrc.com
> >    china.cfdrc.com
> >    china.cfdrc.com
> >
> 
>         Looks strange.
> 
>         1) Which workstation did you run the successful runs on ?
>            china.cfdrc.com ?
>

I ran the code on both china and glasgow successfully with multiple processes.
 
>         2) If that's so, did you try to run the cluster run from
>            china.cfdrc.com ?
> 

When I tried to run a cluster from china (or glasgow), I had the problem.

>         3) log into glasgow and see if /usr/people/p4815/mpich/bin/machines/
>            exists.
>
The file does exist.
 
>         4) What does your mpich/util/machines/machines.sgi look like ?
>            should look like -
> 
>                 china.cfdrc.com
>                 glasgow.cfdrc.com
>                 <----other machines in the cluster, one per line --->
>

You are right !

However, when I used the "secure server", I ran my code on the cluster 
successfully, though tstmachines still failed.

>         Keep me posted.
> 
>         Ray

Thanks Ray

ZJ

