Newsgroups: comp.parallel,comp.parallel.pvm
From: nfotis@theseas.ntua.gr (Nick C. Fotis)
Subject: Re: Distributed Proccessing - How to measure speedups? (FOLLOWUP)
Organization: National Technical University of Athens, Greece
Date: Mon, 1 Aug 1994 16:06:14 GMT
Message-ID: <Ctv62E.A1n@dcs.ed.ac.uk>



Hello all,

I wish to express many thanks to the people who sent to me
their replies about my question on benchmarking distributed
systems. It seems that's a hot topic and under intensive research.

Here follow my original question, the replies, and my remarks (where
exist) in braces '[', ']'

I'm eager to collect more stuff about this - If you have more to add, send
me mail (If I'm slow in replying, don't worry - I go for holiday for
3-4 weeks soon)

Greetings,
Nick.


====== ORIGINAL QUESTION ========

: Greetings to all,

: a professor here asked a very hard question (for me):

: - How can / should I measure the efficient execution of programs in a
:   heterogeneous network?

: He cannot (AFAIK) use the Unix time(1) anymore (or something like that) to
: get the real CPU time, and if we use PVM (or something like that), the
: efficiency becomes a very nebulous target.

: We don't know what to measure anymore - the CPU seconds spent in each CPU
: are rather irrelevant, as we may have CPUs from 30 SPECfp to 300 SPECfp each
: - and the network delays aren't the same on each machine.

: We cannot isolate the network, since it's not our own, and the wall-clock
: time is not adequate metric, since he does research on efficient parallel
: algorithms (till now on homogeneous, shared memory machines)

: We would appreciate advices, pointers to FTP-able reports, etc.
: If we get anything interesting, I'll post a summary follow-up.

: Thanks in advance,
: Nick.

[  ce107@cfm.brown.edu (C. Evangelinos) sent me an electronic version of a
   technical report on benchmarking distributed systems. The title of the
   report is

   Efficiency Evaluation of Some Parallelization
   Tools on a Workstation Cluster Using the NAS
   Parallel Benchmarks

   Florian Sukup
   Computing Centre
   Vienna Institute of Technology
   sukup@dvz.tuwien.ac.at

   ACPC/TR 94-2 January 1994

  The electronic version of the report contained code for the implementatation
  of the NAS benchmarks in the following platforms:

	PVM 3.x, PVM 2.x, p4, Express, Linda

  Since C. Evangelinos cannot recall where he got this archive (he got it via
  the WWW), you can either:

   a. Get in touch with the author, or
   b. FTP the electronic version from ftp.ntua.gr [ 147.102.1.1 ], directory
	/pub/nfotis, file tool_eval.tar.gz (sorry, no compress - our line
	is really slow, so we use gzip instead of compress in transferred
	files)

  -- nfotis ]


======== REPLIES =========

Date: Wed, 20 Jul 1994 10:09:10 +0800
From: fineman@rex.Eng.Sun.COM (Charlie Fineman)


Well, the typical way to judge the efficiency is to determine how
much overhead is speant in just managing the concurrent agents (startup/teardown
time and, of course, communication). I would suggest you instrument the
comminication calls to start with and compare that time to the time speant doing
actual computation. You could do this a little quicker (but at a loss of
information) but using get_rusage on unix and comparing system & blocked time
against CPU time to see how much time is consumed by non-user activities.


 Being able to isolate the network isn't
necessary unless you are trying to measure the efficiency of the 
distributed computing framework (which is a worthwile goal, but didn't
seem to be what you were after).

==========


Date: Wed, 20 Jul 1994 10:45:41 +0600
From: glover@tngstar.cray.com (Roger Glover)


In article <Ct8JBz.KDJ@dcs.ed.ac.uk>, you write:
|> 
|> We cannot isolate the network, since it's not our own, and the wall-clock
|> time is not adequate metric, since he does research on efficient parallel
|> algorithms (till now on homogeneous, shared memory machines)
|> 

I am not sure I understand; are you saying that
wall-clock time is inadequate because of the possible
variation in network load?

If so, one solution would be to run the same job many
times and calculate a confidence interval around the
(harmonic?) mean.   If you have a means of measuring
system/network load you might use it as a correlating
factor.

There are other methods for performing and evaluating
experiments in uncontrolled, non-laboratory
conditions, but they all involve repetition and
statistical analysis.  A good text on probability and
statistics might be helpful here.  I recommend:

	Walpole and Myers, _Probability and Statistics
	for Scientists and Engineers_.  Macmillan
	Publishing Co., Inc., New York, New York, USA.
	ISBN:  0-02-424110-5

==========

From: <venkates@egr.msu.edu>
Date: Wed, 20 Jul 1994 11:42:56 +0500


Hi
I have been faced with an identical dilemma. I have ended up using wall clock time
as an indicator since I believe any communication delays that occur are due to the
distribution. I could not come up with much else. I would aprreciate it if you could 
post/email me a summary of the responses you get

Thanks
Venkatesh Gopinath
Electrical Engineering Dept
Michigan State U.

==========


Date: Wed, 20 Jul 94 14:10:04 BST
From: <gme@sys.uea.ac.uk>

Nick,

I agree with your observation that measuring performance of programs in a
heterogeneous system is difficult. It is a question to which I have given
consideration only recently.

I also agree with your comments concerning the use of CPU time. I am of the
opinion that CPU time is only meaningful in a dedicated (both machines and
network) homogeneous system. As I do not have this, I have never used CPU time.

>From what I have discovered there are basically two ways to go when
considering heterogeneous systems:

1. Modify or `clarify' existing speedup definitions. Interesting papers I've
read in this area include:

V. Donaldson, F. Berman, R. Paturi, Program Speedup in a Heterogeneous
Computing Network, Journal of Parallel and Distributed Computing, 21, 1994.

R. Barr, Reporting Computational Experiments with Parallel Algorithms: Issues,
Measures, and experts' Opinions, ORSA Journal on Computing, 5:1, 1993.

A large number of speedup definitions I have seen involve the use of elapsed
time. Thus the usefulness of the calculated speedup will be reduced when
obtained in a shared (with other users) environment and/or a shared network.

2. Develop completely new approaches. An interesting paper I've read in this
area is:

M. Crovella, T. J. LeBlanc, The Search for Lost Cycles: A New Approach to
Parallel Program Performance Evaluation, 1993, University of Rochester Computer
Science Department techincal report 479.

Hope this is of interest. Let me know if you have any comments or find
anything else interesting.

Gareth.

==========

Date: Wed, 20 Jul 1994 10:59:15 -0400
From: Doug Elias <elias@TC.Cornell.EDU>


Heya...

i'm including here the tex-src to the Methodology section of the Park
Bench Working Group report on benchmarking parallel computing
platforms and applications; you can get the whole thing from netlib...


However, i just tried to ftp to netlib.ornl.gov, and it refused to
accept an anonymous login, so your only other alternative is to use
"xnetlib", send email to netlib@cs.utk.edu for the source if you don't
already have it.  Anyway, what you're looking for (once you get to
netlib) is "pbwg"...

[ I haven't yet tried, but I'll try as soon as I can. Notice that the
  correct place is ftp.netlib.org, and the stuff has moved to pub/parkbench
  It's oodles of stuff ! -- nfotis ]

The section on Methodology follows my .signature, please let me know
if you have problems getting the full report and i'll see what i can
do to get it for you and send it to you somehow.

 [ Just time, give me more time! ;-) nfotis ]

doug

[ I couldn't format the TeX source, but I can read it - I suggest using
 something like LameTeX to strip the codes, and to make the text more
 readable - nfotis ]

------------------------------------------------------------------------------
%      PARKBENCH REPORT (second draft), File:        method4.tex
%------------------------------------------------------------------------
%file method4.tex
%compiled by David Bailey for methodology subcommittee
%text below submitted by Roger Hockney to methodology subcommittee

\chapter{Methodology}

\section{Introduction}
The conclusions drawn from a benchmark study of computer performance
depend not only on the basic timing results obtained, but also on
the way these are interpreted and converted into performance figures.
The choice of the performance metric, may itself influence the 
conclusions. For example, do we want the computer that generates the 
most megaflop per second (or has the highest Speedup), or the computer 
that solves the problem in the least time? It is now well known
that high values of the first metrics do not necessarily imply the 
second property. This confusion can be avoided by choosing a more 
suitable metric that reflects solution time directly, for example 
either the Temporal, Simulation or Benchmark performance, defined below. 
This issue of the sensible choice of performance metric is becoming
increasing important with the advent of massively parallel computers
which have the potential of very high megaflop rates, but have 
much more limited potential for reducing solution time. 


\section{Time Measurement}

In parallel computing we are concerned with the distribution of computational
work to multiple processors that execute simultaneously, that is to say in
parallel. The objective of the exercise is to reduce the elapsed wall-clock
time to solve or complete a specified task or benchmark. The elapsed 
wall-clock time means the time that would be measured on an external clock
that records the time-of-day or even Greenwich mean time (GMT), between the
start and finish of the benchmark. We are not concerned with the origin of 
the time measurement, since we are taking a difference, but it is important
that the time measured would be the same as that given by a difference between
two measurements of GMT, if it were possible to make them. It is important
to be clear about this, because many computer clocks (e.g. Sun Unix function
ETIME) measure elapsed CPU-time, which is the total time that the process
or job which calls it has been executing in the CPU. Such a clock does not
record time (i.e. it stops ticking) when the job is swapped out of the CPU. 
It does not record, therefore, any wait-time which must be included if we 
are to assess correctly the performance of a parallel program.
 
Two low-level benchmarks are provided in the PARKBENCH suite to test the
precision and accuracy of the clock that is to be used in the benchmarking.
These should be run first, before any benchmark measurements are made.
They are:
\begin{enumerate}
\item TICK1 - measures the precision of the clock by measuring the time 
              interval between ticks of the clock. A clock is said to
              tick when it changes its value.
\item TICK2 - measures the accuracy of the clock by comparing a given
              time interval measured by an external wall-clock (the
              benchmarker's wrist watch is adequate) with the same
              interval measured by the computer clock. This tests the
              scale factor used to convert computer clock ticks to seconds, 
              and immediately detects if a CPU-clock is incorrectly being 
              used.
\end{enumerate} 


The fundamental measurement made in any benchmark is the elapsed wall-clock
time to complete some specified task. All other performance figures are 
derived from this basic timing measurement. The benchmark time, $T(N;p)$, 
will be a function of the problem size, $N$, and the number of processors, 
$p$. Here, the problem size is represented by the vector variable, $N$, 
which stands for a set of parameters characterising the size of the 
problem: e.g. the number of mesh points in each dimension, and the
number of particles in a particle-mesh simulation. Benchmark problems of
different sizes can be created by multiplying all the size parameters by
suitable powers of a single scale factor, thereby increasing the spatial and
particle resolution in a sensible way, and reducing the size parameters to
a single size factor (here called $\alpha$). 

We believe that it is most important
to regard execution time and performance as a function of at least the two
variables $(N,p)$, which define a parameter plane. Much confusion has arisen
in the past by attempts to treat performance as a function of a single
variable, by taking a particular path through this plane, and not stating
what path is taken. Many different paths may be taken, and hence many different
conclusions can be drawn. It is important, therefore, always to define the 
path through the performance plane, or better as we do here, to study the 
shape of the two-dimensional performance hill. In some cases there may 
even be an optimum path up this hill.


\section{Units and Symbols}

A rational set of units and symbols is essential for any numerate
science including benchmarking. The following extension of the
internationally agreed SI system of physical units \cite{SI75} is 
made to accommodate the needs of computer benchmarking.

\medskip
New symbols and units: 
\begin{enumerate}
\item flop : number of floating-point operations
\item mref : number of memory references (reads or writes)
\item barr : number of barrier operations
\item b    : number of binary digits (bits)
\item B    : number of bytes (groups of 8 bits)
\item sol  : number of solutions or executions of benchmark
\item ${\rm w}_{32}$ : number of words (number of bits per word as 
             subscript, here 32). Symbol is lower case (W means watt)
\end{enumerate}
Note that flop and mref are both inseparable four-letter symbols.
The character case is significant in all unit symbols so that e.g. Flop, 
Mref, $W_{64}$ are incorrect. Unit symbols should always be printed in 
roman type, to contrast with variables names which are printed in italic.
Because 's' is the SI unit for seconds, unit symbols like 'sheep' do not
take 's' in the plural. 

\medskip
SI provides the standard prefixes:
\begin{enumerate}
\item k    : kilo meaning $10^3$
\item M    : mega meaning $10^6$
\item G    : giga meaning $10^9$
\item T    : tera meaning $10^{12}$
\end{enumerate}
This means that we cannot use M to mean $1024^2$ (the binary mega) as is 
often done in describing computer memory capacity, e.g. 256 MB. We can 
however introduce the new prefix:
\begin{enumerate}
\item K    : meaning 1024, then use a subscript 2 to indicate the binary
             versions
\item ${\rm M}_2$    : binary mega $1024^2$
\item ${\rm G}_2$    : binary giga $1024^3$
\item ${\rm T}_2$    : binary tera $1024^4$
\end{enumerate}
In most cases the difference between the mega and the binary mega (4\%) 
is probably unimportant, but it is important to be unambiguous. In this
way one can continue with existing practice if the difference doesn't 
matter, and have an agreed method of being more exact when necessary.
For example, the above memory capacity was probably intended to mean
$256 {\rm M_2 B}$.

As a consequence of the above, an amount of computational work involving
$4.5 \times 10^{12}$ floating-point operations is correctly written as 
4.5 Tflop. Note that the unit symbol Tflop is never pluralised with an
added 's', and it is therefore incorrect to write the above as 4.5 Tflops 
which could be confused with a rate per second. The most frequently used 
unit of performance, millions of floating-point operations per second 
is correctly written Mflop/s, in analogy to km/s. The slash is necessary 
and means 'per',  because the 'p' is an integral part of the unit symbol 
'flop' and cannot also be used to mean 'per'.  


\section{Floating-Point Operation Count}

Although we discourage the use of millions of floating-point
operations per second as a performance metric, it can be a useful 
measure if the number of floating-point operations, $F(N)$, 
needed to solve the benchmark problem is carefully defined.

For simple problems (e.g. matrix multiply) it is sufficient to use a 
theoretical value for the floating-point operation count (in this case $2n^3$ 
flop, for nxn matrices) obtained by inspection of the 
code or consideration of the arithmetic in the algorithm. For more complex
problems containing data-dependent conditional statements, an empirical method
may have to be used.  The sequential version of the benchmark code defines
the problem and the algorithm to be used to solve it. Counters can be inserted 
into this code or a hardware monitor used to count the number of floating-point
operations. The latter is the procedure followed by the {\sc PERFECT} Club 
\cite{Berr89}. In either case a decision has to be made regarding the number
of flop that are to be credited for different types of floating-point 
operations, and we see no good reason to deviate from those chosen by 
McMahon \cite{Ma88} when the Mflop/s measure was originally defined. 
These are:

\begin{table}[h]
\centering
\begin{tabular}{ll}
add, subtract, multiply		& 1 flop \\
divide, square-root		& 4 flop \\
exponential, sine etc.		& 8 flop \\
{\sc IF(X .REL. Y)}		& 1 flop \\
\end{tabular}
\end{table}

Some members of the committee felt that these numbers, derived in the 1970s,
no longer correctly reflected the situation on current computers. However,
since these numbers are only used to calculate a nominal benchmark flop-count,
it is not so important that they be accurate. The important thing is that they 
do not change, otherwise all previous flop-counts would have to be 
renormalised. In any case, it is not possible for a single set of ratios to
be valid for all computers and library software. I (rwh) suggest the committee
stays with the above ratios until such time as they become wildly wrong
and extensive research provides us with a more realistic set. 


We distinguish two types of operation count. The first is the nominal 
benchmark floating-point operation count, $F_B(N)$, which is found in the 
above way from the defining Fortran77 sequential code. The other is the
actual number of floating-point operations performed by the hardware
when executing the distributed multi-node version, $F_H(N,p)$, which may be 
greater than the nominal benchmark count, due to the distributed version 
performing redundant arithmetic operations. Because of this, the hardware 
flop-count may also depend on the number of processors on which the benchmark
is run, as shown in its argument list. 


\section{Performance Metrics}

Given the time of execution $T(N;p)$ and the flop-count $F(N)$ several different
performance measures can be defined. Each metric has its own uses, and gives 
different information about the computer and algorithm used in the benchmark.
It is important therefore to distinguish the metrics with different names,
symbols and units, and to understand clearly the difference between them.
Much confusion and wasted work can arise from optimising a benchmark with
respect to an inappropriate metric. The principal performance metrics are:

\subsection{Temporal Performance}

If we are interested in comparing the
performance of different algorithms for the solution of the same problem, then
the correct performance metric to use is the {\it Temporal Performance},
$R_T$, which is defined as the inverse of the execution time
\begin{equation}
                            R_T(N;p)=T^{-1}(N;p)              \label{Eqn(1)}
\end{equation}
The units of temporal performance are, in general, solutions per second
(sol/s), or some more appropriate absolute unit such as 
timesteps per second (tstep/s). With this metric we can be sure
that the algorithm with the highest performance executes in the least time,
and is therefore the best algorithm. We note that the number of flop does not
appear in this definition, because the objective of algorithm design is not
to perform the most arithmetic per second, but rather it is to solve a given
problem in the least time, regardless of the amount of arithmetic involved.
For this reason the temporal performance is also the metric that a 
computer user should employ to select the best algorithm to solve his problem, 
because his objective is also to solve the problem in the least time, and he 
does not care how much arithmetic is done to achieve this.

\subsection{Simulation Performance}

A special case of temporal performance occurs for simulation programs in which
the benchmark problem is defined as the simulation of a certain period of
physical time, rather than a certain number of timesteps. In this case we speak
of the {\em Simulation Performance} and use units such as {\em simulated days
per day} (written sim-d/d or 'd'/d) in weather forecasting, where the 
apostrophe is used to indicate 'simulated'; or {\em simulated
pico-seconds per second} (written sim-ps/s or 'ps'/s) in electronic device
simulation. It is important to use simulation performance rather than
timestep/s if one is comparing different simulation algorithms which may 
require different sizes of timestep for the same accuracy (for example an
implicit scheme that can use a large timestep, compared with an explicit
scheme that requires a much smaller step). In order to maintain numerical
stability, explicit schemes also require the use of a smaller timestep as 
the spatial grid is made finer. For such schemes the simulation performance 
falls off dramatically as the problem size is increased by introducing 
more mesh points in order to refine the spatial resolution: the doubling 
of the number of mesh-points in each of three dimensions can reduce the 
simulation performance by a factor near 16 because the timestep must also 
be approximately halved. Even though the larger problem will generate more 
Megaflop per second, in forecasting, it is the simulated days per day 
(i.e. the simulation performance) and not the Mflop/s, that matter to the user.

As we see below, benchmark performance is also measured in terms of the amount
of arithmetic performed per second or Mflop/s. However it is important to
realise that it is incorrect to compare the Mflop/s achieved by two algorithms 
and to conclude that the algorithm with the highest Mflop/s rating is the best
algorithm. This is because the two algorithms may be performing quite 
different amounts of arithmetic during the solution of the same problem.
The temporal performance metric, $R_T$, defined above, has been introduced
to overcome this problem, and provide a measure that can be used to compare
different algorithms for solving the same problem. However, it should be 
remembered that the temporal performance only has the same meaning within the 
confines of a fixed problem, and no meaning can be attached to a
comparison of the temporal performance on one problem with the temporal
performance on another.

\subsection{Benchmark Performance}

In order to compare the performance of a computer on one benchmark with 
its performance on another, account must be taken of the different amounts of 
work (measured in flop) that the different problems require for their solution.
Using the flop-count for the benchmark, $F_B(N)$, we can
define the {\em Benchmark Performance} as
\begin{equation}
                            R_B(N;p)=F_B(N)/{T(N;p)}           \label{Eqn(2)}
\end{equation}
The units of benchmark performance are Mflop/s (benchmark name), where we 
include the name of the benchmark in parentheses to emphasise that the 
performance may depend strongly on the problem being solved, and to emphasise 
that the values are based on the nominal benchmark flop-count. In other 
contexts such performance figures would probably be quoted as examples of the 
so-called {\em sustained} performance of a computer. We feel that the use of 
this term is meaningless unless the problem being solved and the degree of 
code optimisation is quoted, because the performance is so varied across 
different benchmarks and different levels of optimisation. Hence we favour 
the quotation of a selection of benchmark performance figures, rather than a 
single sustained performance, because the latter implies that the quoted 
performance is maintained over all problems.

Note also that the flop-count $F_B(N)$ is that for the defining sequential 
version of the benchmark, and that the same count is used to calculate $R_B$ 
for the distributed-memory (DM) version of the program, even though the DM 
version may actually perform
a different number of operations.  It is usual for DM programs to perform more
arithmetic than the defining sequential version, because often numbers are
recomputed on the nodes in order to save communicating their values from a
master processor. However such calculations are redundant (they have already
been performed on the master) and it would be incorrect to credit them to the
flop-count of the distributed program. 

Using the sequential flop-count in the
calculation of the DM programs benchmark performance has the additional 
advantage that it is possible to conclude that, for a given benchmark,
the implementation that has the highest benchmark performance is the best 
because it executes in the least time.  This would not necessarily be the 
case if a different $F_B(N)$ were used for different implementations of the 
benchmark. For example, the use of a better algorithm which obtains the
solution with less than $F_B(N)$ operations will show up as higher benchmark
performance. For this reason it should cause no surprise if the benchmark 
performance occasionally exceeds the maximum possible hardware performance. 
To this extent benchmark performance Mflop/s must be understood
to be nominal values, and not necessarily exactly the number of operations
executed per second by the hardware, which is the subject of the next
metric. The purpose of benchmark performance is to compare different
implementations and algorithms on different computers for the solution of
the same problem, on the basis that the best performance means the least
execution time. For this to be true $F_B(N)$ must be kept the same for
all implementations and algorithms.  


\subsection{Hardware Performance}

If we wish to compare the observed performance with the theoretical 
capabilities of the computer hardware, we must compute the actual number of
floating-point operations performed, $F_H(N;p)$, and from it the actual
{\em Hardware Performance} 
\begin{equation}
                            R_H(N;p)=F_H(N,p)/{T(N;p)}             \label{Eqn(3)}
\end{equation}
The hardware performance also has the units Mflop/s, and will have the same 
value as the benchmark performance for the sequential version of the benchmark. 
However, the hardware performance may be higher than the benchmark performance 
for the distributed version, because the hardware performance gives credit for 
redundant arithmetic operations, whereas the benchmark performance does not.
Because the hardware performance measures the actual floating-point operations
performed per second, unlike the benchmark performance, it can never exceed
the theoretical peak performance of the computer.

Assuming a computer with multiple-CPUs each with multiple arithmetic pipelines,
delivering a maximum of one flop per clock period, the theoretical peak value
of hardware performance is
\begin{equation}
   r^*= \frac{fl.pt.pipes/CPU}{clock.period}\times number.CPUs   \label{Eqn(4)}
\end{equation}
with units of Mflop/s if the clock period is expressed in microseconds. By 
comparing the measured hardware performance, $R_H(N;p)$, with the theoretical 
peak performance, we can assess the fraction of the available performance that 
is being realised by a particular implementation of the benchmark.


\subsection{Speedup, Efficiency and Performance per Node}

\begin{verbatim}
It was agreed that this subsection be redrafted by David Bailey.
The first draft text is retained until the substitute is ready

--------------------------  START OLD TEXT  ---------------------------------
\end{verbatim}

We do not favour the use of any of the popular performance metrics: 
speedup, efficiency or performance per node; because all these are 
either easily misinterpreted or obscure important effects. 
The speedup of a benchmark code 
is defined as the ratio of the $p$-processor temporal performance to the 
single-processor temporal performance. It is a very useful and convenient 
measure if we are concerned with the optimisation of a particular code in 
isolation, because its value can easily be compared with the maximum possible 
speedup, namely the number of processors being used. We can thereby assess
how much of the potential hardware performance is being utilised. However
benchmarking is to do with comparing the performance of different computers,  
and all the above three metrics are unsuitable for this purpose.

Speedup compares the performance of a code with itself, and might therefore 
be called an introspective, or even incestuous measure. Problems can therefore 
arise (see below), and incorrect conclusions can be drawn, if we try to use 
speedup to compare different algorithms on the same computer, or the same 
algorithm on different computers. This is because speedup is a relative 
measure (it is defined as the ratio of two performances), and therefore all 
knowledge of the absolute performance has been lost. 
Benchmarking, however, is concerned with the comparison of the absolute 
performance of computers, and therefore the use of a relative measure like
speedup is not very useful, and can be positively misleading.  
For example, we do not wish to conclude that a computer with a large number of
slow processors and therefore high value of speedup, is faster than another 
with fewer processors and therefore with a lower speedup, if in fact the 
reverse is the case, because the processors on the second computer are so much 
faster. Only by adopting absolute measures of performance with physical units 
involving inverse time, can one avoid this type of false conclusion.

Speedup is not even useful for comparing the performance of one algorithm with 
another on the same computer, because it is not necessarily true that the 
algorithm with the highest speedup executes in the least time (see, e.g. 
~\cite{Cvet90}). One can only be sure that this is the case if the 
single-processor temporal performance of both algorithms is the same, which 
is most unlikely. If the single-processor performances of the two algorithms 
are different and we compare the speedups of the two algorithms, then we are 
comparing the performance of the two algorithms measured in different units.
This is like comparing the speeds of two cars, one measured in m.p.h. and the
other in cm/s. Such a comparison has no validity either for cars or 
algorithms. Computers and algorithms can only safely be compared in terms
of their absolute performance in solving a problem. The most unambiguous
measure is the temporal performance, which is the inverse of 
the time of execution, or the related simulation performance. 

The benchmark performance per node might seem to be an attractive 
metric because it is an absolute measure which
can be related directly to the hardware performance of node. However
it has the major defect that it hides the point at which the absolute 
performance begins to decrease as the number of processors increases. If we
plot benchmark performance against number of processors, this point is
clearly visible as a maximum, however if the same data is plotted as
performance per node, all we see is a very uninteresting monotonically 
falling line, and the important maximum has disappeared. The efficiency, 
which is defined as the speedup divided by the number of processors, is 
doubly condemned because it is a relative measure and hides the maximum.

\begin{verbatim}
---------------------------  END OLD TEXT  ---------------------------------
\end{verbatim}

\section{Performance Database}

\begin{verbatim}
It was agreed that this subsection be redrafted by Jack Dongarra.

--------------------------  START OLD TEXT  ---------------------------------
\end{verbatim}

The database of benchmark performance results should be based on an
extension of the excellent X-window display demonstrated by Jack Dongarra
at the March 1993 PBWG meeting. 

\begin{verbatim}
---------------------------  END OLD TEXT  ---------------------------------

----------------------  SOME PROPOSED NEW TEXT  ----------------------------
\end{verbatim}

At present each benchmark measurement for a particular problem size $N$ and
processor number $p$, is represented by one line in the database with
variable length fields chosen by the benchmark writer as suitable and 
comprehensive to describe the conditions of the benchmark run. The fields
separated by a marker (|) include, benchmarkers name and e-mail, computer 
location and date, hardware specification, compiler date and optimisation 
level, $N$, $p$, $T(N,p)$, $R_B(N,P)$ and other metrics as deemed appropriate
by the benchmark writer. Ideally, the line for the database would be 
produced automatically as output by the benchmark program itself.

\begin{verbatim}
----------------------  END PROPOSED NEW TEXT  ----------------------------
\end{verbatim}


\section{Interactive Graphical Interface}

The Southampton Group has agreed to provide an interactive graphical front 
end to the PARKBENCH database of performance results. To achieve this,
the basic data held in the Performance Data Base should be values of
$T(N;p)$ for at least 4 values of problem size $N$, each for sufficient
$p$-values (say 5 to 10) to determine the trend of variation of performance
with number of processors for constant problem size. It is important that
there be enough $p$-values to see Amdahl saturation, if present, or any 
peak in performance followed by degradation. A graphical interface is
really essential to allow this multidimensional data to be viewed in any
of the metrics defined above, as chosen interactively by the user.
The user could also be offered (by suitable interpolation) a display of 
the results in various scaled metrics, in which the problem size is 
expanded with the number of processors.

In order to encompass as wide a range of performance and number of 
processors as possible, a log-scale on both axes is unavoidable, and
the format and scale range should be kept fixed as long as possible
to enable easy comparison between graphs. A three-cycle by three-cycle
log-log graph with range 1 to 1000 in both $p$ and Mflop/s would cover
most needs in the immediate future. Examples of such graphs are to be
found in \cite{Hoc92,Add93}. 

A log/log graph is also desirable because the size and shape of the Amdahl 
saturation curve is the same wherever it is plotted on such a graph. 
That is to say there is a universal Amdahl curve that is invariant to 
its position on any log/log graph. Amdahl saturation is a two-parameter 
description of any of the performance metrics, $R$, as a function of $p$ 
for fixed $N$, which can be expressed by
\begin{equation}
                   R = \frac{R_\infty}{(1 + \phalf/p)}
\end{equation}
where $R_\infty$ is the saturation performance approached as $p \rightarrow 
\infty$ and \phalf is the number of processors required to reach half
the saturation performance. The graphical interface should allow this
universal Amdahl curve to be moved around the graphical display, and
be matched against the performance curves. The changing values of the two 
parameters \Rphalf should be displayed as the Amdahl curve is moved.

As more experience is gained with performance analysis, that is to say
the fitting of performance data to parametrised formulae, it is to be
expected that the graphical interface will allow more complicated formulae
to be compared with the experimental data, perhaps allowing 3 to 5
parameters in the theoretical formula. But, as yet, we do not know what
these for parametrised formula should be.


\section{Benchmarking Procedure and Code Optimisation}

Manufacturers will always feel that any benchmark not tuned specifically
by themselves, is an unfair test of their hardware and software. This is
inevitable and from their viewpoint it is true. NASA have overcome this 
problem by only specifying the problems (the NAS paper-and-pencil 
benchmarks \cite{naspar2}) and leaving the manufacturers to write the 
code, but in many circumstances this would require unjustifiable effort
and take too long. It is also a perfectly valid question to ask how a
particular parallel computer will perform on existing parallel code, and
that is the viewpoint of PARKBENCH. 

The benchmarking procedure is to run the distributed PARKBENCH suite on
an 'as-is' basis, making only such non-substantive changes that are required 
to make the code run (e.g. changing the names of header files to a local
variant). The as-is run may use the highest level of automatic compiler
optimisation that works, but the level used and compiler date should be
noted in the appropriate section of the performance database entry.      

After completing the as-is run, which gives a base-line result, any form of 
optimisation may be applied to show the particular computer to its best 
advantage, up to completely rethinking the algorithm, and rewriting
the code. The only requirement on the benchmarker is to state what has been
done. However, remember that, even if the algorithm is changed, the official
flop-count, $F_B(N)$ that is used in the calculation of nominal benchmark
Mflop/s, $R_B(N,p)$, does not. In this way a better algorithm will show up
with a higher $R_B$, as we would want it to, even though the hardware 
Mflop/s is likely to be little changed.

Typical steps in optimisation might be:
\begin{enumerate} 
\item explore the effect of different compiler optimisations on a single 
      processor, and choose the best for the as-is run.
\item perform the as-is run on multiple processors, using enough values 
      of $p$ to determine any peak in performance or saturation.
\item return to single processor and optimise code for vectorisation,
      if a vector processor is being used. This means restructuring loops 
      to permit vectorisation.
\item continue by replacement of selected loops with optimal assembly coded 
      library routines (e.g. BLAS where appropriate).
\item replacement of whole benchmark by a tuned library routine with the
      same functionality.
\item replace whole benchmark with locally written version with the same 
      functionality but using possibly an entirely different algorithm 
      that is more suited to the architecture.
\end{enumerate} 
% ----------------------------------------------------------------------------


==========

Date: Wed, 20 Jul 1994 10:31:45 -0500
From: Amitabh B Sinha <sinha@sal.cs.uiuc.edu>


I read your posting in comp.parallel about measuring speedups on
networks of workstations. I think the problem is very difficult to
solve because one cannot have a good measure of global time: I have
some post-execution methods to do an approximate measure, but they
are not very accurate. I would be very interested in any responses
you get to your posting.

Thanks,
Amitabh Sinha


==========


Date: Thu, 21 Jul 1994 11:01:07 -0400
From: chrisc@thumper.bellcore.com (Christopher D Carothers)


	Nick, most workstations compilers should allow you
to build your program with profiling turned on. Then use
"prof" or "gprof" to get a execution profile for each machine.
You may have to perform some normalizations on the data to
account for different clock cycle times of the various machines.
You should be able to determine where bottlenecks exist in you
program. This should be a good starting point. A profiler will
not give you network transmission delays, but you should beable
to perform a specific test, such a message ping-pong test between
each pair of machines, to find this information out and see
if it correlates with the profiler information.

[ Very improbable to give meaningful results, IMHO -- we'll think about it 
	- nfotis ]

==========


Date: Thu, 21 Jul 94 10:48:51 PDT
From: Bill Baker <bbaker@apache.tricity.wsu.edu>


We are working with implementations of parallel algorithms here in a
heterogeneous pvm environment.  To measure speedup we decided, like you,
that this environment would not give us meaningful results.  For
the speedup measurements only we used a homogeneous environment of
all the same machines, in our case HP 710s.  We had access to about
10 of this type of machine.  I don't have any suggestions for you
in the case of only having 1 or 2 machines of any given type.  

I would be interested in any other suggestions that people might have
and would appreciate knowing what you eventially decide to do.

[ we're still thinking about it - nfotis ]
Bill.

==========


Date: 	Mon, 25 Jul 1994 12:37:50 +0200
From: Thomas Schnekenburger <schneken@informatik.tu-muenchen.de>

I think the following paper gives an answer to your question.

@conference{Schnekenburger93c,
	author =        "Schnekenburger, Thomas",
	title =      {Efficiency of Parallel Programs in Multi-Tasking Environments},
 	booktitle =	{Performance Evaluation of Parallel Systems PEPS'93},
	year = 		"1993",
	pages= "75-82",
}

A similar technical report

@TechReport{Schnekenburger93tr,
	author =      "Schnekenburger, Thomas",
	title =       {A Definition of Efficiency of Parallel Programs
in Multi-Tasking Environments},
	institution = {Technical University of Munich},
	year =        "1993",
	number =      "SFB 342/3/93 A",
}

is available by anonymous ftp in   

ftp.informatik.tu-muenchen.de:local/lehrstuhl/paul/parmod/doc/PEPS93-Schnekenburger.ps.Z

[ I didn't yet try to get it - German hasn't very fast lines, like Greece :-( ]

======== END TEXT =====
-- 
Nick (Nikolaos) Fotis         National Technical Univ. of Athens, Greece
HOME: 16 Esperidon St.,       InterNet : nfotis@theseas.ntua.gr
      Halandri, GR - 152 32   UUCP:    pythia!theseas!nfotis
      Athens, GREECE          FAX: (+30 1) 77 84 578



