March 16, 1989

How the various subsystems of the RAPID testbed work, as of version 4,
the Mach port of version 3. Version 3 was used for ICPP '89 data.

Replacement algorithm
=====================
  This is a direct implementation of the RU-set scheme described in
the papers. Each processor maintains a local RU-set (recently-used
set) of sectors, a linked list in malloc'd memory. When a sector is
referenced, it is moved to the front of the list. The list is limited
in length to the size of the local 'working set'. If the list would
get too long, the last thing on the list is removed from the list. 

Unfortunately, all of my current tests use a local RU-set size of 1,
so this process is overly complicated. I will replace the above with a
toss-immediately strategy in the next version.

In the code, the RU-sets are referred to as working sets.

The global RU-set is represented by counters: each frame has a counter
telling how many local sets contain this frame. At this point the
frame status is likely IN_WS. Thus when it leaves a local set, the
counter is decremented. When it enters a local set, the counter is
incremented.  When the counter hits zero, it has left the global set
and so it is placed on the old-frame-queue, and is eligible for
replacement.

Since frames are pulled off the old frame queue in FIFO (LRU) order, a
frame may re-enter the global set while on the queue. It remains on
the queue, though it should not be there. Thus the status of the frame
remains OLDFRAME to indicate this; it will not be put back on the
queue if the global RU-set counter again hits zero. 

To obtain a frame for use in fetching a new sector, a frame is pulled
from the old frame queue. If it ha re-entered the global RU_set (ie,
its counter is non-zero), it is discarded (status changed to IN_WS)
and another is pulled from the queue. This is repeated until one is
found. The frame's status then becomes PREFETCHED, if for prefetching,
or TRANSITION, if for a demand fetch.

If it is for a demand fetch, the status becomes IN_WS after the I/O is
complete and the frame is in the local working set. Otherwise, it
changes from PREFETCHED to IN_WS when the sector is first used by a
process.

Changing the size of the working set
====================================
  On startup and shutdown the working set expands and contracts to fit
the number of processes with the file open. Frames are added or
removed as necessary. 

To add frames, the indices of the frames are pulled from the
non-frame-queue, space is allocated, the status is changed to
OLDFRAME, and they are place on the old-frame-queue.

To remove frames, frames are pulled from the old-frame-queue
(marking as IN_WS and discarding any that have re-entered the global
RU-set), marked as NONFRAME, freed, and their indices put back on the
non-frame-queue. 

A problem arises when their are frames filled with prefetched buffers
that have not yet been used, and thus there are not enough on the
old-frame-queue to be freed. See, when the working set contracts the
prefetch limit also decreases. If it decreases below the number of
unused frames, then we must wait until the unused frames are used
before they can be freed. In the meanwhile no new prefetching can
occur, until the number of unused frames drops below the prefetch
limit. 

The solution is not to wait, but to mark the number that need to be
removed, and to have the first process that notices that prefetching
is again possible to free up that many frames. That is, once enough
frames have been used to satisfy both the number that need to be
removed and one for the prefetchers to do some prefetching, both will
happen. This means checking this prefetch_remove counter whenever you
get the right to prefetch.

This is all a hassle and will be removed in the next version.

Prefetching technique
=====================
  The basic prefetching action occurs as follows. Prefetching is done
as a coroutine to the main program. It can suspend itself anywhere in
the code and be resumed when the main code has time again. At any
time, if the prefetch action is not possible or involves a wait, it
fails and suspends itself. Sometimes it gives up what it has obtained
to allow others to have them. 

First, it checks to see if there is work available. This is a quick
test involving the reference string, to see that there are sectors
needin prefetch. 

If that succeeds, it trys to get the right to prefetch. This is done
by increasing the prefetch_unused counter and checking it against the
prefetch limit. If it exceeds the limit, we can't prefetch. If not, we
have the right to prefetch. This means we WILL be able to get a frame
to prefetch into. 

Then we get work to prefetch, as described in the reference string
section. If this fails now, we give up the right to prefetch and fail. 

Then we try to lock the sector. If it has been locked by someone else
(presumably for demand fetching), we fail. 

Then we get a frame to prefetch, by pulling it off the old frame
queue, as described above.

Then we issue the I/O request, update the sector map entry, and quit.

Sector Map entry
================
The sector map entry contains the following fields:
    OID sector_OID;			/* OID of frame object containing sector */
    ALOCK sector_lock;		/* ALOCK is a short */
    short sector_status;		/* status of the sector */
    short sector_frame;		/* index of frame containing sector, or -1 */
    short use_count;		/* count of number of users */
    OID disk_fake;			/* address of the sector in fake disk */
    unsigned long whenready;	/* time when I/O completes for this sector */

The sector_OID is the same OID as the frame containing the sector.
This allows quick re-use of the information when there is a hit.
The sector_lock was used by the old ramfile code for locking between
processes to avoid write conflicts. Not used by base-level RAPID code.
The sector_status is described below. The sector_frame is the frame
index of the frame holding the sector. The use_count is the number of
processes actually USING the sector information (reading or writing)
at this moment. The disk_fake is a pointer to a block where the
backing store for the sector is kept; with a real disk, this would be
a disk address. The whenready entry is the time when the I/O on the
sector completes, used mostly when the sector is hit to wait for any
outstanding I/O. 

Sector Map entry status
=======================
The status is comprised of several bits: VALID, DIRTY, NONZERO,
CHANGING, and PREFETCHED. The initial status of all sectors is
NONZERO, indicating there is some nonzero data on disk for this
sector. Of course, if there is zero data (not really kept on disk),
even this will be clear, allowing for optimizing disk I/O (by not
doing it). 

All work with the sector status is done with macros that use
Atomic_ior and Atomic_and to change a bit or set of bits atomically.
There are also macros for checking certain bits.

When the sector is in a frame, it is VALID. If it has been written, it
is DIRTY. If it was brought in as the result of a prefetch, but has
not yet been used, it is PREFETCHED. Finally, as a lock for the sector
status, it can be CHANGING. Anything that wants to change the status
of the sector (eg VALID to inVALID, inVALID to VALID, inVALID to VALID
and PREFETCHED), must first lock it by setting the CHANGING bit. It
must then wait for the use_count to go to zero, to allow any activity
in the sector's buffer to complete. 

Normal activity in a VALID sector proceeds as follows. To read a
sector, check to see if it is VALID and not changing. If so, increment
the use_count. If it is still VALID and not changing, you have the
right to use it. When done, decrement the use count. If it was ever
changing, you fail and must decrement the use_count and wait until it
stops changing. If, when it stops changing, it is not VALID, start a
demand fetch.

To write a sector, acquire the right to use it as for a read, but also
set the dirty bit.

Reference String
================
  The reference string is produced by the driver once per run. Thus
each run must be only for one pattern type. It is stored in shared
memory, with only one copy, when it is global, or there are several
separate strings in private memory, if it is local. Each string has
the following attributes:
    boolean global;			/* global (T) or local (F) pattern */
    unsigned int *sectors;	/* array of sector numbers */
    unsigned int *portions;	/* array of indices that are end of portion */
    unsigned short nsectors;	/* length of sectors[] */
    unsigned short nportions;	/* length of portions[] */
    unsigned short next;		/* index of next sector to read */
    unsigned short nextp;	/* index of next sector to prefetch */
    unsigned short cur_portion_end; /* index of end of current portion */
    short nextp_lock;		/* lock on nextp */
    boolean portion_limit;	/* limit nextp to current portion? */
There is a list of sectors, and a list of the indices in the sectors
array that are the beginnings of portions, if appropriate. There is
also the number of sectors and portions. There are two indices, one
for the next sector to read (used by the driver), and one for the
next sector to be prefetched (used by the prefetch daemon). The
cur_portion_end is set by the driver for the use of the prefetch
daemon. nextp_lock is used to lock nextp, since atomic_add is
insufficient to regulate the changes that need to be made to it.
Finally, the portion_limit flag tells the prefetch daemon whether
prefetching is limited to the current portion. 

Driver
======
  The driver reads in several lines of a file, each describing a test.
Generally, these lines represent a test with prefetching, in which
case it also does the corresponding non-prefetch base case. It creates
a reference string once, as above, and uses the same string on all the
tests in the file. This makes tests fairly comparable. Of course, for
lw, gw, lps, gps the pattern is always the same. 

I/O simulation
==============
  The I/O is simulated with a separate module. A buffer is allocated
for each sector that serves as a fake backing store, although this
currently turned off with a #define, to save memory. There are a
number of disks allocated, each represented by a structure. Each has a
number representing the time that it will complete all current
requests to that disk. By atomically adding the fixed disk response
time that we use (30 msec) to this time, the time that a new request
will complete can be determined.  This also allows for highly
concurrent scheduling of the disk. The I/O routine can then either
wait for that time to come around and return, or return the time that
it will be ready to the caller so the caller can do other things (ie
prefetching) while the disk I/O completes.

Statistics
==========
 Statistics are maintained in many places. Elogs monitor most every
activity. The disk structures have disk statistics. Each local
processor keeps some stats in its rfd (file descriptor), and a few
more are kept in the global descriptor. These are collected at the end
by the driver, who averages them and prints them out. 

Server
======
 In the Chrysalis version the server was a separate system-wide
process that open, closed, and cleaned up files. In the Mach version I
made this a monitor inside the uniform system processes of the
program. It is a monitor in order to maintain the simple serial nature
of the server, but to allow all processes easy procedure-call
interfaces to the server, and to allow them to share data. The entire
thing will disappear in the next version.

The file
========
Currently the file is represented on disk, and opened by all the
processes when they open the ramfile. This maintains the permanent
state of the file that used to be held by the ramfile object in
chrysalis. But now I only keep the inode in there, reconstructing the
sector maps and frame tables from scratch each time. When each process
opens it (through the serialized monitor, remember) they check the
first word of the file. If null they are the first to open the file,
so they allocate space for the inode and read it in and do all the
work of setting it up. Then the pointer to the inode is written to the
first word of the file and the file is closed. Later processed will
open the file, find the first word non-null, and use that word as the
inode pointer. Still other locks and flags are used to serialize more
work after the security of the monitor is left. It is a real hack and
a real mess to open a ramfile. This will all be rewritten, not using
the disk file at all (possibly using a disk file that describes the
file parameters). 

On close, a similar process is necessary to rewrite the inode and to
make the first word NULL again. 
