Newsgroups: comp.arch,comp.parallel,comp.sys.super
From: Mark Crovella <crovella@csb.bu.edu>
Subject: Summary of Performance Prediction BOF at Supercomputing 94
Organization: Computer Science Department, Boston University, Boston, MA, USA
Date: Mon, 9 Jan 1995 13:36:51 GMT
Message-ID: <3ek5nd$9v5@news.bu.edu>

At Supercomputing '94, over 35 researchers met in a Birds of a Feather
session on performance prediction.  The session was well received and
generated considerable discussion during and after.  This short report
summarizes the session and the discussion it generated.

This session was very stimulating to me and I want to thank all the
attendees.  A special note of thanks goes to the speakers, each of whom
provided excellent presentations despite the time constraints of the
abbreviated format.

Mark Crovella            
Asst Prof, Dept of Computer Science, Boston University
crovella@bu.edu                 http://cs-www.bu.edu/faculty/crovella/Home.html
-------------------------------------------------------------------------


			     Summary Report
	  Birds of a Feather Session on Performance Prediction
			   Supercomputing '94

Performance prediction --- the estimation of running time or other
application metrics in advance of execution --- appears to be a
promising tool for developers of supercomputing applications.  It can be
used to accelerate program design, evaluate the benefits of porting to a
new machine, and explore application scalability.  This report
summarizes the Birds of a Feather session that brought together over 35
researchers to discuss current directions and issues in performance
prediction at Supercomputing '94.

Speakers were:
	- Ulrich Kremer, Department of Computer Science, Rice University
	- Gordon Lyon, Parallel Processing Group, NIST
	- Ko-Yang Wang, Distributed Compilation Group, IBM T. J. Watson
		Research Center
	- Thomas Fahringer, Institute for Software Technology and
		Parallel Systems, University of Vienna
	- Jaspal Subhlok, Department of Computer Science, CMU
	- Alistair Dunlop, Department of Computer Science, U. Southampton
	- Tao Yang, Department of Computer Science, University of
		California at Santa Barbara
	- Brian Van Voorst, Scientific Computing Branch, NASA Ames
		Research Center
	- Roger Hockney, University of Southampton 

Each speaker gave a short (10 minute) summary of current work and
interests.  These abstracts follow at the end of this document.
Following these short presentations, discussion was opened
up and touched on the following topics:

	- "Users": compilers, programmers, both
	- Accuracy: relative performance vs. absolute predictions
	- Parallel vs. Serial performance prediction
	- Static analysis vs. Dynamic measurements vs. Simulation
	- Levels of abstraction: instruction level vs. block level

Details of the discussions follow.

Users of Performance Prediction.   
--------------------------------
Performance prediction techniques were discussed for two "users":
compilers and programmers.  Compilers need performance prediction to
select effective optimization techniques from the many available,
including program transformations.  Programmers need performance
prediction to select efficient program designs, make cost-benefit
tradeoffs, and study scalability.

However, these goals are not sharply distinguishable.  Sophisticated
compilers perform what amounts to program redesign, for example, when
data layouts are automatically determined.  In addition, users may
provide hints or assistance to the compiler in its optimization process.
As a result, the line between performance prediction for compilers and
performance prediction for programmers is blurry.

Accuracy of Performance Prediction.  
-----------------------------------
Performance prediction accuracy ranges from simply a relative ordering
on alternative designs, to precise prediction of running times.
Relative ordering seems to be used mainly in the compiler world, where
the choice of whether to apply an optimization is a binary decision.
Highly accurate prediction is especially useful for programmers, as when
studying the cost/benefit of a port to a new machine.

Unfortunately, highly accurate parallel performance prediction usually
requires accurate serial performance prediction.  This is becoming an
increasingly difficult problem as processors become more complex.   In
particular, the use of superscalar designs makes the execution time of
serial code dependent on instruction scheduling.   The importance of
accurate serial code prediction is increasing as performance grows more
"brittle", that is, as the difference between the performance of
efficiently schedulable serial code and inefficiently schedulable code
increases. 

Techniques in Performance Prediction.
-------------------------------------
A wide range of techniques were discussed, ranging from analysis based
on a small set of measurements, to simulation (both at the machine level
and at the processor level), to extensive static analysis of source
code.   Dynamic, machine-based analysis seems more appropriate for the
programmer; simulation is used by both programmers and compilers; and
static analysis is a natural tool for use by compilers.  

Another distinction was noted between techniques that operate at the
instruction level, versus those that operate on blocks of code as a
group.  Instruction level techniques seemed to be used mainly in the
compiler world;  analysis at such a fine level of detail seems to be too
much for the programmer.

Current Issues in Performance Prediction.
-----------------------------------------
Quality of cross-machine prediction was cited as a strong concern.
Until accurate cross-machine methods are developed, performance
prediction will have only narrow application.   The need for appropriate
machine benchmarks, capturing the right machine characteristics, was
considered part of this problem.

The difficulty of developing accurate parallel performance prediction
as serial performance prediction becomes increasingly difficult was
discussed.   In addition, the increasing complexity of compilers is
making performance prediction much harder to perform at the source
level. 

Prospects for Performance Prediction.
-------------------------------------
It would seem that performance prediction could become increasing
important in assisting in machine acquisition decisions.  The difficulty
of choosing a parallel machine is partially due to the uncertainty
associated with the expected performance of important applications.

Another prospect is for the monitoring of performance bugs.  When an
application has a predicted performance that is based on accurate
machine models, then variation from that performance can indicate the
presence of a performance bug.  The cross checking of application
performance against prediction could be used to identify those bugs.

Abstracts of Presentations.
---------------------------

			     Ulrich Kremer
	    Department of Computer Science, Rice University
	    Performance Prediction for Automatic Data Layout

High Performance Fortran (HPF) is rapidly gaining acceptance as a
language for parallel programming. Since data layout is the key
intellectual step in writing an efficient HPF program, tools for
automatic data layout and performance estimation will be crucial if the
language is to find general acceptance in the scientific community.  An
automatic data layout assistant tool for HPF-like languages is currently
under development as part of the D System at Rice University.  The data
layout assistant uses performance estimates at different stages in the
data layout selection process. I will give a short overview of the
current framework for automatic data layout and discuss the different
requirements for performance prediction.


			      Gordon Lyon
		    Parallel Processing Group, NIST
	       Performance Sensitivities in Parallel Code

Over the last three years NIST researchers have explored a novel
approach to analyzing performance sensitivities of parts of parallel
(MIMD) programs.  Based upon statistically designed experiments, the
technique has a special setup method that greatly simplifies its
application.  Size of the host system or the overall program is
immaterial.  The approach scales very well and comparisons can be made
of code adapted to very different architectures, e.g. shared-memory
versus distributed-memory.  S-Check is a new tool that embodies the NIST
technique.  A programmer first selects pieces of code that might be
bottlenecks, say X and Y.  S-Check treats the program as a transfer
function R with multiple parameters, R(X,Y).  The tool automatically
makes and times (offline) trials dictated by the demands of the
underlying statistical analysis.  From these response measurements,
S-Check develops an approximate Taylor expansion of the transfer
function R in terms of X, Y and an interaction term "XY".  Term
coefficients reveal execution sensitivities of the specimen to code
changes.  The evaluations are quantitative.


			      Ko-Yang Wang
    Distributed Compilation Group, IBM T. J. Watson Research Center
       Performance Prediction and Automatic Compiler Optimization

I will briefly discuss my work on performance prediction for
superscalar-based sequential and parallel architectures.  I will discuss
the uses of performance estimation with symbolic manipulation and their
applications in automatic program optimization.


			    Thomas Fahringer
Institute for Software Technology and Parallel Systems, University of Vienna
  The P3T, a Performance Estimator for Vienna Fortran and HPF Programs

The P3T is an interactive performance estimator that assists users in
performance tuning of scientific Fortran programs. It successfully
guides both programmer and compiler in the search for efficient data
distribution strategies and profitable program transformations.  Four of
the most critical performance aspects of parallel programs are
estimated: load balance, cache locality, communication and computation
overhead. The P3T is an integrated tool of the Vienna Fortran
Compilation System, which enables the estimator to aggressively exploit
considerable knowledge about the compiler's analysis information and
code restructuring strategies.  The user can interactively specify new
architectures on which the performance estimates - as obtained by the
P^3T - are based on.  An advanced graphical user interface allows to
filter and visualize performance data at various levels of detail.


			     Jaspal Subhlok
		  Department of Computer Science, CMU
  Performance prediction and debugging for compiler generated programs

In parallel languages like HPF, the programmer provides a high level
description of the parallelism while the compiler generates the actual
parallel code.  This presents a problem for performance tools since the
executing program is different from the source program.  One situation
where the compiler needs performance prediction occurs when task (or
functional) and data parallelism are mixed in the same program, and the
compiler has to predict the mapping for the best performance. We show
that accurate modeling and performance prediction can be achieved by
using the compiler to guide the collection and interpretation of profile
information. For example, the compiler is able to distinguish between
replicated and parallel computations and recognize implicit
communication in the program, which cannot be done only with runtime
profiling. The result is that we are able to build an accurate
performance model at a low cost.  We demonstrate a tool that we use for
compiler driven performance analysis and discuss how it is used to
predict good mappings of a parallel program onto the nodes of a parallel
machine.


			    Alistair Dunlop
       Department of Computer Science, Southampton University, UK

Within the ESPRIT project P6643 (PPPE) we have developed a method for
estimating the execution time of message passing Fortran programs on
distributed memory machines. Our approach is based on static program
analysis and limited program simulation. The static analysis phase is
used to determine the program execution path, array sizes and array
reference information. Following the static analysis phase, the
performance estimator generates a simulation driver program.  This
driver program replicates the control flow of the original Fortran
program but simulates the memory references only. The simulation driver
program is subsequently linked with a machine specific simulation
library, and the code executed. The expected minimum and maximum
execution times are estimated together with a detailed breakdown of
functional unit use within the processor. Initial predictions on a CM5
have a high correlation with actual observed execution times.


				Tao Yang
Department of Computer Science, University of California at Santa Barbara
	       Task Scheduling and Performance Prediction

Many scientific applications could be modeled by acyclic or cyclic task
graph computation. The performance predicted by task scheduling can
identify the impact of communication overhead and guide program
partitioning.  In this talk, we will discuss the automatic task
scheduling techniques and a system for scheduling and code generation on
message-passing machines.  We discuss our experiences in applying these
techniques in scientific computing.


	     Sekhar R. Sarukkai, Pankaj Mehra and Jerry Yan
		     presented by Brian Van Voorst
	 Scientific Computing Branch, NASA Ames Research Center
Tools for Automated Performance Prediction of Message-passing Parallel Programs

Performance prediction and scalability analysis tools are essential for
developing scalable parallel programs. 	Unfortunately, most tools rely
on programmer-supplied hand models to predict performance, thus
restricting the use of such tools. In this presentation, we highlight
our experience in attempting to build automated models of deterministic
parallel programs. We use static compiler tools and dynamic run-time
information, to extract computation and communication models of programs
that are then analysed.  Model analysis can be performed using symbolic
analysis or simulations.  With the help of a few examples we show that
predicted execution-time using symbolic analysis is useful (and
reasonably accurate) for determining regions of performance saturation,
while simulation is invaluable for extracting more detailed performance
information and in explaining the reason for saturation.





-- 
Mark Crovella            
Asst Prof, Dept of Computer Science, Boston University
crovella@cs.bu.edu              http://cs-www.bu.edu/faculty/crovella/Home.html


