Newsgroups: comp.parallel.pvm
From: le_saux@IRO.UMontreal.CA (Eric Le Saux)
Subject: Timing Distributed Processes
Keywords: pvm timing distributed processes
Organization: Universite de Montreal
Date: Tue, 16 Aug 1994 19:25:41 GMT
Message-ID: <Cun7Au.Evq@IRO.UMontreal.CA>

   We want to time master-slave PVM programs.  All the slaves run on
different but identically configured machines (Sparc 5).

   Lets suppose we are alone on the network then a real-time clock
gives quite stable measurements.  It can be used to compare runs
on different problem sizes.

   The problem is that most of the time, those machines are used for
other purposes (as you would have guessed).  The usual thing to do in
those circumstances is for each slave to lower it's own priority
(with setpriority()).

   In that context, wall-clock time can still be used if you want to
know how much time it really takes under the prevailing conditions.

   ...But that is not what we want.

   Much more adequate for us would be the cpu time.  Every slave can
return to the master it's own elapsed cpu time.  We can then write
some simple equations:

            n
           __
   1) M +  \  S[i]
           /_  
           i=1


           n
   2) M + max (S[i])
          i=1


   where M is the master's cpu time,
         S[k] the kth slave's cpu time,
         and n is the number of slaves.


   The equation (1) gives all the cpu time used, as if everything was
done sequentially.  That may be useful.

   But since we want a "parallel" time measurement, every "parallel"
second of cpu time should be counted once only, and equation (2) is
better in that respect.  It assumes that slaves *could* really overlap
in time under real conditions.  It will take into account the time of
packing and unpacking PVM message buffers, but not the transmission
time from machine to machine.


   So let us conclude on that.  We currently have two solutions for
timing distributed programs:

   1)  Isolate a group of machines during off-hour and use real time
       measurements of our master program.

   2)  Sharing ressources by running slaves at lower priority, at any
       time of the day, and using a function of cpu time for our
       measurements.

   (Maybe a mix of the two, even though the measurements would not be
    compatible for comparison purposes).

   I'd rather prefer something based on cpu time, since I would not
have to deal with external performance degradation due, for example,
to the system backuping its disks!


   So if you have any comments on all this, particularly if you have a
better function based on cpu time, I would greatly appreciate it.



                           Eric LE SAUX

                           Center for research on tranportation
                           Montreal, Quebec, Canada

