Newsgroups: comp.sys.transputer
From: krste@ICSI.Berkeley.EDU (Krste Asanovic)
Subject: Re: sneaky suspicion wanted
Organization: International Computer Science Institute, Berkeley, CA, U.S.A.
Date: 12 Dec 1994 19:55:18 GMT
Message-ID: <3ci9r6$5gr@agate.berkeley.edu>

This is due to an interaction between the transputer instruction
prefetcher and the memory accesses caused by the increments of i in
your loop.

I wrote a short note on this that appeared in a 1989 edition of the
Occam users group news letter. The following is my original mail
message, slightly edited with the corrections Geraint Jones added to
the printed version.

-- 
Krste Asanovic                                email: krste@icsi.berkeley.edu
International Computer Science Institute,     phone: +1 (510) 642-4274 x143
Suite 600, 1947 Center Street,                  fax: +1 (510) 643-7684
Berkeley, CA 94704-1198, USA                   http://www.icsi.berkeley.edu

From root Fri May 26 06:38:28 1989
>From uucp Fri May 26 06:44 BST 1989 remote from hirst4
>From mmdfmaster Fri May 26 06:26:01 1989 remote from gec-rl-hrc
Via:  gec-rl-hrc.co.uk; 26 May 89 6:22 GMT
Via:  uk.ac.ukc; 26 May 89 6:22 GMT
Received: from prg.oxford.ac.uk by kestrel.Ukc.AC.UK   via Janet (UKC CAMEL FTP)
           id aa25302; 25 May 89 8:01 BST
Received: by uk.ac.ox.prg (4.12/prgv.35)
	id AA20208; Thu, 25 May 89 07:22:00 bst
Sender: transputer-request@uk.ac.oxford.prg
Received: from nsfnet-relay by uk.ac.ox.prg (4.12/prgv.35)
	id AA19849; Thu, 25 May 89 06:45:50 bst
Received: from [128.84.248.35] by NSFnet-Relay.AC.UK   via NSFnet with SMTP
           id aa04811; 25 May 89 6:23 BST
Received: by tcgould.TN.CORNELL.EDU (5.59-1.11/1.6)
	id AA18186; Wed, 24 May 89 06:14:54 EDT
Received: from DEVVAX.TN.CORNELL.EDU by tcgould.TN.CORNELL.EDU (5.59-1.11/1.6)
	id AA18182; Wed, 24 May 89 06:14:46 EDT
Received: from [192.48.96.2] by devvax.TN.CORNELL.EDU (5.59-1.10/1.3-Cornell-Theory-Center)
	id AA24050; Wed, 24 May 89 06:15:52 EDT
Received: from ukc.UUCP by uunet.uu.net (5.61/1.14) with UUCP 
	id AA04674; Wed, 24 May 89 06:15:49 -0400
Received: from gec-rl-hrc.co.uk by kestrel.Ukc.AC.UK   via PSS (UKC CAMEL FTP)
           id aa11067; 24 May 89 10:57 BST
To: gec-rl-hrc!transputer@edu.cornell.tn.tcgould
Subject: Re: Transputer timing
From: krste <krste@uk.co.gec-rl-hrc>
Date: Wed, 24 May 89 10:53:00 GMT
Message-Id:  <8905241056.aa01776@lemon.gec-rl-hrc.co.uk>
Original-Sender: transputer-request@edu.cornell.tn.tcgould

Dear Erick

After our meeting last week I returned to find your question about
transputer timings on the mailing list. I am posting this reply via the
list as I think this might be of general interest.

Briefly, the original question was over execution times for a loop coded
as follows:

        ldc     10000
        stl     2

.align  /* Put L0 on a word boundary. */

L0:
        /* Insert variable number of "ldc 0" instructions. */

        ldl     2
        adc     -1
        stl     2
        ldl     2
        eqc     0
        cj      L0

The number of "ldc 0" instructions is varied and the total execution time
for the run is measured. Below are Erick's original timings (for a Parsytek
card?) together with some I gathered on the Tadpole card in our Sun. The
execution time *decreases* with added "ldc 0" instructions!

Number of | Erick's       | Tadpole
"ldc 0"s  | timings in us | card
----------+---------------+---------
    0     |    16832      |  20352
    1     |    15422      |  19328
    2     |    14398      |  17856
    3     |    15870      |  17856
    4     |    18430      |  22848
    5     |       ?       |  21888

As Erick suggested, this occurs due to the operation of the instruction
pre-fetch buffer. Below, I've attempted a more detailed analysis based on
my own understanding of the transputer's architecture but would appreciate
any more precise information (Inmos?).

Instructions are pre-fetched in parallel with program execution. On a given
cycle, the CPU is taking instructions from one 4-byte buffer while the
pre-fetch may be filling another. When the CPU takes the last instruction
from it's buffer, the buffers are swapped and another instruction pre-fetch
cycle is started. The transputer has only one memory space (i.e. non-Harvard
architecture) and so instruction pre-fetches compete with data accesses for
the memory bus.

When a jump occurs, an instruction fetch must occur before the CPU can begin
executing code at the new location. However, a second pre-fetch cycle is
started immediately to fill the pre-fetch buffer. If the user code contains
instructions which access memory (ldl, stl etc.) then these will be delayed
until after the second pre-fetch cycle completes.

If the "cj" instruction is the last byte in a word, then a new pre-fetch is
started even if jump is taken. This pre-fetch will be for the instructions
five bytes after the "cj" byte and must be completed before the instructions
at the jump destination can be fetched. The "cj" instruction takes 4 cycles
if taken and these can overlap with the useless pre-fetch, so this end of the
pipeline break is not0too disastrous.

I've drawn some diagrams below which show the interaction between the
pre-fetch's use of the memory bus and the program's data accesses, and these
illustrate how extra instructions can cause faster execution of the loop.
The diagrams are for the Tadpole card which has a 20MHz T414 with
250ns DRAM cycles. Each character position horizontally represents
one clock cycle (50ns). Instructions codes are written vertically and
are placed on the cycle on which they're read. The periods labelled
"execution" show when the transputer is running program code. The asterisks
show when either prefetch or the program's data access is using the memory
bus.

First with no "ldc" instructions, then with 3 "ldc" instructions:

cycle no.           11111111112222222222333333333344
          012345678901234567890123456789012345678901

inst      L    l          nas    l          enc
          0    d          fdt    d          qfj
          :    l          icl    l          ciL
               2          x 2    2          0x0
execution           .............     ............
data acc.           ******  *****     ******
prefetch  **********             *****         *****  takes 42 cycles


inst      L    llll       nas       l     enc
          0    dddd       fdt       d     qfj
          :    cccl       icl       l     ciL
               0002       x 2       2     0x0
execution      ...  ........   .................
data acc.           ******     ***********
prefetch  **********      *****           *****       takes 38 cycles

These diagrams are not 100% accurate, I believe greater overlap occurs than
these diagrams suggest (I haven't the time to set up a logic
analyser and Inmos keep quiet about the transputer micro-architecture).
However, they are within a cycle or two of the measured figures.

CONCLUSIONS

In general, the "ldc" instructions can be replaced with any instruction which
doesn't access external memory. These will then be executed in parallel with
the second instruction pre-fetch. A compiler could re-arrange code to take
advantage of this, by placing code which doesn't reference memory at the head
of word-aligned looping constructs. Also, within a section of code it is
theoretically possible to optimise code (perhaps by adding "pfix 0" nops)
so that instruction pre-fetches occur when non-memory instructions are
executed, but this is very tricky and linked to a given hardware
implementation's memory cycle.

Unnecessary pre-fetches at the tail of a loop are avoided if the "cj"
instruction is not the last byte of a word. This might also be tricky to
arrange and doesn't gain you much given that "cj" takes four cycles in any
case.

The second execution trace shows how well the transputer can saturate its
memory bus, but also shows how tightly its performance is coupled to
the speed of external memory (how do you fit a Lisp system into internal
RAM?). Expect to see caches on future transputers.....


+---------------------------------------------------------------------------+
| Krste Asanovic                        email: krste@uk.co.gec-rl-hrc       |
| GEC Hirst Research Centre             phone: +44 (1) 908 9662             |
| East Lane                             fax:   +44 (1) 904 7582             |
| Wembley                               telex: 923429 GECLAB G              |
| Middlesex                                                                 |
| HA9 7PP                                                                   |
| England                                                                   |
+---------------------------------------------------------------------------+

