             The STARFISH Parallel file-system simulator
       (Simulation Tool for Advanced Research in File Systems)

                              David Kotz
                          Dartmouth College
                             Version 3.0
                             October 1996

A very rough sketch of the structure of the simulator.

(If you can, I suggest using the tags facility of vi or emacs to hop
around in the code; the makefile is set up to create the emacs TAGS
file.)

Most parameters are set at compile time by using the config program to
edit the *.param files.  'Configfile' and 'ParamHelp' control the way
config does this.  [setparam is a shell script that can be used to
change some parameters; I use this when generating lots and lots of
versions of proteus, with all different parameters.  The catch is that
setparam does not fix "derived" parameters, as defined in ConfigFile,
so it can lead to incorrect param files in some cases.]

The simulator outputs some status and debugging information to stdout,
which I usually capture for reference.  All measurement data is
written to a (binary) .sim file, which is later extracted by the
'simex' program, derived from the original proteus 'stats' program.
'stats' can also be run interactively, based on info in 'Graphfile'
(though it seems to be complaining about my Graphfile nowadays; it
works if I cut out the last few graph definitions).

driver.c is the main program.  This runs on processor 0, gets things
initialized, runs one experiment, and quits.  Thus processor 0 is
always occupied although usually idle.  Driver reads three final
parameters from the command line: pattern, Nio, and Ncomp.  Pattern
decides the particular experiment (access pattern); Nio is the number
of IOPs, and Ncomp is the number of CPs.

The driver starts up an appropriate worker thread on each CP or IOP,
via worker.ca.  The IOP worker threads (iop.ca) initialize the IOP and
exit.  The CP worker threads (cp.ca) initialize the CP and then run
the requested pattern.  The driver thread on processor 0 participates
in some of the same barriers and thus detects when the initialization
is complete, and when the experiment is complete.  In this way, driver
can measure the elapsed time (although I typically use more detailed
times measured in cp.ca for most things).  Then the driver starts a
shutdown thread on each IOP, via worker.ca, and exits.

I should note that I have changed Proteus in a few ways.  
 -- In particular, I added a non-preemptive FIFO thread scheduler.
The implications are significant: because it is non-preemptive, I
don't have to worry about concurrency control in many situations.  I
can assume that many operations run to completion.  Unfortunately, in
some of the IOP code (particularly iopfs-cache) this gets EXTREMELY
hairy, because you have to prove that you'll never block during a
critical period.
 -- I added Sanjay's extension to 64-bit time values.
 -- I changed the sleep timer implementation from a structure that
woke up every small interval and looked at the sleep queue to one that
used an event to wake itself only when the next sleep-queue event
needed to awake.  This is MUCH faster for long sleeps.
 -- I tweaked some of the network behavior.
 -- fixed several bugs

cp.ca contains code that defines each of the basic file access
patterns (see pattern.c and pattern.h for a list).  After jumping
through a big table, it sets up and executes a particular pattern.
For the matrix patterns it does some calculations to decide how to
distribute the records of the file across the processors, allocates
the user buffer, and then makes calls on the CP file system code to
arrange the data transfer.  It also contains the two-phase I/O code;
again, each pattern has its own function, which calls Whole_rw_block
to read the data, then sends the data to the right places by sending
messages; or vice versa.

cpfs.ca includes one of cpfs-*.ca, and defines the CP side of the file
system.  There are several, although only 'none' and 'direct' have
been used in papers I have published so far (Oct '96).  In fact, the
others are crufty at this point and probably no longer work, or even
compile.  All but 'direct' support basic Read and Write commands,
which transfer some arbitrary section of the file to or from the user
buffer.  To do this they break the request into smaller requests to
send to the file system code on the IOPs.  Because the file is striped
across disks block by block, and the disks are distributed round-robin
among IOPs, each IOP request is for at most one block.  cpfs-direct
does not support these interfaces; instead, cpfs-direct supports only
whole-file collective transfers, where the pattern is specified as an
option.  After a barrier synchronization, the request is passed on to
all the IOPs (which must be iopfs-general) for disk-directed I/O.
Stub versions of these cpfs-direct functions are found in
cpfs-direct-stubs.ca, to be used in combination with one of the
simpler file systems.  These stubs accomplish the same access patterns
by calling on the primitive CPFS Read and Write routines.

There are a few 'special' patterns.  Some are for testing, and are
found in tests.ca.  Another does an out-of-core LU decomposition, and
is found in lu.ca.  The LU program uses FSread_lu and FSwrite_lu, much
like the other collective, high-level access functions; this is either
passed through to the IOP (if using cpfs-direct) or implemented by
cpfs-direct-stubs in terms of FSread and FSwrite. 

iop.ca contains the IOP worker code, which is mostly a matter of
initializing and cleaning up the IOP file system.  This code is a bit
complicated by the QueueRequest option, which I have never really
used. 

iopfs.ca includes one of the iopfs-*.ca file system definitions.
There are a few, but I have only used the 'cache' (traditional
caching) and 'general' (disk-directed I/O) file systems.  iopfs-block
is pretty dumb and archaic.  The IOP file systems are much more
complicated than most of the CP file systems, because they have a lot
of state set up once and then used by incoming requests.  Note that
some of the data structures are per IOP and some are per disk.  Each
arriving request starts a new thread, and usually ends with a reply
back to the requesting CP.

iopfs-cache is pretty complicated, but basically it works like this.
There is a fixed pile of buffers.  Each buffer is on one of several
lists: active, for buffers that are being used by some thread right
now; inactive, for buffers that have data but are not active; and
free, buffers with no data.  Buffers become active (and are "claimed")
when a thread wants to use the data in that buffer; when the thread is
finished with the buffer, it unclaims the buffer; when the last claim
disappears, the buffer moves to the inactive list.  Threads looking
for a buffer for a block that is not in the cache look first on the
free list.  If the free list is empty (normal once things get going),
it looks on the inactive list.  If it can find a clean buffer, it
grabs it, changes it for the new block, claims it, and makes it
active.  If it can only find a dirty buffer or one with outstanding
I/O, it adds itself to a waitlist for that buffer and sleeps.
Eventually, when that buffer is available, this thread will wake up.
If it can find nothing, it puts a dummy entry on the active list, and
sleeps; eventually someone will notice us and give us a buffer.
Other threads looking for the same block will always find us and sleep
on the same queue.  Eventually, we will either get the buffer we are
waiting for (because there are no claimers and the I/O has completed),
or (if someone came along wanting the block that's in the buffer we're
waiting for) we will get some other buffer (because they steal back
their buffer and then find us another one).  Whew.  That's just a
sketch!  This is a complicated structure but it manages the
concurrency pretty well. 

Otherwise iopfs-cache uses one-block readahead (if there is a handy
buffer available) and a WriteFull policy (specifically, it writes the
buffer when the number of bytes written to the buffer is some multiple
of the size of the buffer; usually, that's when the buffer is fully
written).

iopfs-general.ca implements disk-directed I/O, and goes with
cpfs-direct.ca.  After computing some parameters that help figure out
the distribution pattern, it calls IOPread_whole with an 'empty'
function or IOPwrite_whole with a 'fill' function.  The _whole
functions tell the disk driver to preread/prewrite, which means that
we provide the range of blocks that will be read or written, so that
the disk driver can get organized.  Then we spawn two threads for each
disk, each with one buffer.  Each buffer thread repeatedly fills a
block and empties a block.  For reading, this means calling
DiskDriverReadNext to fill its buffer with some block from the disk
(remember,the disk driver knows the list of blocks, and gets to choose
the order of transfer), and then calling the empty function (a
parameter) to split that buffer into Memput messages to the CPs.  For
writing, this means calling DiskDriverWhatNext to find out what block
should be written next, calling the fill function (a parameter) to
fill the buffer via Memget messages sent to the CPs, and then calling
DiskDriverWriteNext to deliver the filled buffer to the disk driver
for writing.

The iopfs-general empty and fill functions are given the disk number,
the block number, the buffer, and some precomputed parameters.  They
figure out where in the file they are using some of the parameters,
and then call the pattern-specific 'mapchunk' function to map the file
chunk to a CP and offset within the CP.  Then they can use Memput or
Memget to transfer the data.  The 'empty' function could just blast
out the Memputs as fast as it wants, but I've found that we can
sometimes overflow the input FIFOs at the CPs so I wait for the CP to
ACK before sending another Memput; I do allow concurrent Memputs to
different destinations, but only one outstanding to any particular
destination.  Global variables allow me to avoid waiting for the ACKs
within the same 'empty' call, or even within the same thread, allowing
maximum asynchrony.    The 'fill' function has a similar restriction
of one outstanding Memget per CP from each IOP, but here of course we
have to wait for the response before we can finish the 'fill'
function, because we need to wait for the data.

[A new addition in STARFISH 3.0 is the "queued" memput and memget
functions.  These are used both by iopfs-general and by the
two-phase-I/O code in cp.ca.  These versions of the functions just
take the request and append them to an outgoing buffer.  Each
destination has its own outgoing buffer.  When the outgoing buffer is
"full", ie, the next request cannot fit into the buffer, it is sent to
the destination.  (For memgets "full" means that the reply message
will be full.)  Once the message is sent out the buffer can be reused.
But, for flow control reasons mentioned above, we don't actually send
more than one outstanding request to any given destination.  Because
the reply from the destination is handled by an interrupt handler,
there are lots of potential race conditions and so the code is fairly
complex.]

The IOP file system calls diskdriver functions to interact with the
disk device.  The disk driver is multithreaded, with one thread per
disk.  Each disk thread waits for requests on its own queue, serving
them in FIFO order.  A preread request is special, causing the disk
thread to switch to a special preread queue for a while.  Once the
disk thread has a request, it maps the request (which is a logical
file block number within this disk) to a physical disk block number,
calls the disk device, and blocks until that request is complete.
When it wakes up, it cleans up and wakes up anyone sleeping waiting
for this request to complete (if any).

A note about the disk queue.  There are two queueing disciplines.  The
most basic is DISK_FCFS, where the disk-request queue is a plain FIFO
queue.  (Actually, implemented as a dualq.)  Otherwise, I use the new
(in 3.0) cyclic-scan algorithm implemented in diskq.ca.  Here, each
request is added to one of two lists: the current pass, for all
incoming requests that are to a higher sector number than the current
location, and the next pass, for all incoming requests that are for a
lower sector number than the current head position.  There is also a
special short-cut path for the very common case of an empty request
queue, with one thread waiting for something to be put in the queue:
the inserting thread simply hands the new item to the waiting thread,
and wakes it up.  There is a bit of special code to handle a few cases
where "special" requests like "sync" need to be appended to the
current list of requests. 

The disk device code is described in Dartmouth tech report
PCS-TR94-220.  It uses the Proteus event system to schedule its own
events, and in that sense is very different from the other code in
this simulator.  It is not a thread, it is not cycle counted, it does
not run on a particular processor; it just schedules Proteus events.
Unfortunately the glue required to do this (diskevent.c) is messier
than you'd hope.  The events come pretty fast, about two per sector,
so this can cause pretty slow execution when there's a lot of I/O. 


UserData: a very important trick is defined in userdata.h.  To save
memory (RAM and swap space) I don't always want to allocate full-sized
buffers.  So, if I define REAL_DATA (userdata.param), all the buffers
that hold user data (user buffers in cp.ca, file system buffers in
cpfs-*.ca and iopfs-*.ca, message buffers that get shipped around)
will be allocated to their full size and will actually contain data.
That data will really be copied, and in fact the disk device code will
open a Unix file (DISKxx) for each disk, and read and write data.  I
use this in patterns where the actual data values matter, like LU
decomposition.   In most of my other patterns, however, the actual data
values don't matter, and I'm just wasting time and space moving bytes
around.  (Although you might note that cp.ca will fill the buffer with
some dummy data for verification purposes, when REAL_DATA is defined;
see verify2.c). 

So, if REAL_DATA is not defined (sometimes I call this FAKE_DATA), all
those things that hold user data will NOT be allocated to full size
(in fact, each buffer will be one integer, regardless of size), and
the diskdevice won't bother with reading and writing data to Unix
files.  To hide this stuff, I use a typedef UserData to represent any
user data, and have a pile of macros and functions that PRETEND to
allocate, free, initialize, copy, message-pass, or zero these UserData
regions, and count simulated time exactly the same as if they were
doing the operation on real data.  All of this was verified through
careful tracing of time spent in critical pieces of code, to make sure
that the code behaved exactly the same way with both real and fake
data.

message.ca: There are some handy message-passing functions.  The most
basic are ThreadRequest, QueueRequest, and InstantRequest.
ThreadRequest sends a message to the other side, where the interrupt
handler starts a new thread to execute the function whose pointer is
given in the header of the message. (A lot like Active Messages.)  The
requester continues without blocking.  --- QueueRequest is similar,
except that there is a set of buffers on the receiving node, and the
incoming message is stuffed into one of those buffers, and linked into
a queue.  Presumably some existing thread will pull it off the queue.
--- InstantRequest is also similar, except that no thread is started,
the function is executed directly by the interrupt handler (with
interrupts still off).  (Even more like active messages.)  This is
handy for quick little things like Memgets.

No reply is implied by any of the above message passing, but if the
request contained enough information, the recipient can reply with
Reply or ReplyACK.  The original requester should have send its own
processor number (reply_to) and an action pointer (reply_at) to the
recipient, and these are given to the Reply function, along with some
data (no data for ReplyACK).  The 'action pointer' is a pointer to a
'reply action' in the original requester.  A reply action contains a
few things: a pointer where data should be deposited when the reply
arrives, a 'skip' count of bytes to be stripped from the front of the
reply before depositing data, a flag to indicate that the reply has
arrived, and a thread ID of a thread that wants to be woken when the
data arrives.  This reply action is wonderfully flexible, because it
allows the requester to specify how it will deal with the reply
without having to tell the other end anything about it; in particular,
the reply action can be changed while the request is outstanding.  For
example, the reply action may be initialized with no thread ID; if the
message arrives before any thread waits for it, the flag is set so
that a thread arriving later will see it, and if a thread decides it
must wait for that reply, it can set the thread ID to itself and go to
sleep.  I've even used it in situations where one thread 'steals' the
wait from another, by substituting its own thread ID.  Another example
is that the data pointer can be changed while the request is
outstanding, eg, to redirect a prefetch request so that the data goes
directly to a user buffer instead of to the cache buffer.  

Finally, of course, there are Memput and Memget.  These transfer data
directly between a message and a user buffer, much like a Reply does,
but without requiring a specific Reply Action to be communicated to
the other end.  Instead, any recipient of memput or memgets specifies
a base address, so the sender of memput or memgets need only send an
offset.  Memput is a function called by the sender, which sends a
special message interpreted by Memput_handler.  Memput_handler in turn
uses ReplyACK to send an ACK for flow control purposes.  Memget is
different.  The requester (the 'getter', if you will) uses
InstantRequest to send Memget request; InstantRequest calls Memget on
the 'gettee' side, which then uses Reply to send the requested data
back to the getter, who of course has set up a reply action telling
where to put the data there.

----------------- the main program
driver.c
worker.ca, worker.h 

------------- defines the access patterns
pattern.c, pattern.h

------------- tricks to use REAL data or FAKE data
userdata.c, userdata.h, userdata.param

-------------- the CP, and CP file systems
cp.ca
cpfs.ca, cpfs.h, cpfs.param   includes one of the following:
cpfs-none.ca		no caching
cpfs-single.ca		single buffer on each CP
cpfs-double.ca		double buffer, D+1 buffers for D disks
cpfs-thread.ca		cleaner structure, using a thread for each buffer
cpfs-direct.ca		for disk-directed I/O
cpfs-direct-stubs.ca	used with none, single, double, and thread
permute.ca, permute.h	support for two-phase I/O in cp.ca

------------ LU decomposition 'pattern'
lu.ca, lu.param	    does LU decomposition, called by cp.ca

-------------- the IOP, and IOP file systems
iop.ca
iopfs.ca, iopfs.h, iopfs.param   includes one of the following:
iopfs-none.ca		no caching
iopfs-block.ca		(archaic) one buffer per CP per disk 
iopfs-cache.ca		full-blown cache 
iopfs-direct-stubs.ca	used with none, block, and cache
iopfs-general.ca	disk-directed I/O

------------ describing the file
file.h, file.param

--------------- model of the disks
disk.h, disk.param, diskmodel.param

diskdriver.ca, diskdriver.h   called by iopfs-* to access the disk

diskq.ca, diskq.h	      disk-scheduling queue

disklayout.ca, disklayout.h   called by iopfs-* to define layout

diskdevice.c, diskdevice.h    includes one of the following
diskdevice-model.c	      the real model
diskdevice-trivial.c	      faster, simpler model for debugging
diskdevice-dfk.h
diskdevices.h

diskevent.c, diskevent.h      interface from diskdevice to proteus event system

---------------  message passing support
message.ca, message.h, message.param
protocol.h

--------------- auxiliary files:
dfk.h			      my personal favorite #defines
dmcache.h, dmcache.param      needed by everything
proteus.h		      basic definitions for using proteus 
aux.c, aux.h		      non-cycle-counted misc support functions
util.ca, util.h		      cycle-counted misc support functions
userdebug.c		      proteus snapshot hook: calls to debug functions
tests.ca, tests.h	      simple tests of internal functionality
time.h			      defines the measures of time
dummy.param		      used by Configfile

------------ some synchronization and communication functions
barrier.ca, barrier.h	      barrier synchronization
ready.ca, ready.h	      essentially, half of a barrier
broadcast.ca, broadcast.h     broadcast communication
condition.ca, condition.h     condition variables (synchronization)

------------ some ADTs
dualq.ca, dualq.h	      ADT 'dual-queue', a special kind of queue
pool.ca, pool.h		      ADT 'pool', an unordered bag of stuff
buflist.ca, buflist.h	      ADT list of buffers, used by iopfs-cache
queue-cyc.ca, queue-noncyc.c, queue.h	your basic queue (cycle counted or not)

------------ some files for measurements
stats.c, stats.h    defines the metrics we keep
user-events.h	    defines events that we'll be recording to events.sim

----------------------------------------------------------
There are some auxiliary PROGRAMS, compiled separately.

unstripe.c	reads DISK?? files created when REAL_DATA is used, 
		unstripes the data found in those files, and 
		spits out the data, in the correct order, to stdout.
		usable only with contiguous layout

verify-run	runs simulator and verify2
verify.c	a simple verify program, out of date
verify2.c	reads the data file (produced by unstripe) to see if 
		the data makes sense; good for verifying write patterns.

lutest.c	reads the data file before and after the LU pattern
		does a sequential LU decomposition and compares results.
----------------------------------------------------------

David Kotz
Assistant Professor
Department of Computer Science
Dartmouth College
6211 Sudikoff Laboratory
Hanover, NH  03755-3510 USA
email: dfk@cs.dartmouth.edu
URL: http://www.cs.dartmouth.edu/~dfk/
603-646-1439

