Newsgroups: comp.parallel
From: muller@sdcc33.ucsd.edu (Keith Muller)
Subject: SCI Out of Gas at Starting Gate?
Keywords: SCI
Organization: University of California, San Diego
Date: Tue, 28 Nov 1995 18:26:04 GMT
Message-ID: <49fk7s$mtc@sdcc12.ucsd.edu>

While doing some very rough performance estimates the other day with a grad 
student about using SCI rings to create a CC-NUMA SMP some very odd results
came up.  Does anyone see what is wrong in either assumptions or calculations?

The press indicates Intel will be delivering powerful PentiumPro
processor chip sets based upon a higher speed (100 MHz?) bus sometime
in 1997.  One can assume, based upon Intel's recent rack record that the
processor chips will be running at close to 300 MHz by that time.

Since you will be able to put together a quad-PentiumPro SMP with the
Intel-supplied chipset (or that one can purchase a complete motherboard),
you would think that using SCI to bridge the processor-memory buses to make a
larger SMP (though it would be a CC-NUMA architecture) would be reasonable.

Well, not so fast. The following (very) simple analysis indicates that
SCI may not be able to support a NUMA system using 1997 Intel microprocessors.

Here are the assumptions:

o PentiumPro issues 1 memory reference/cycle
o 300 MHz clock rate
o Second level cache miss rate of 1%
o SMP programming model (no optimization for NUMA)

For the purposes of this analysis, all the assumptions err on the
conservative side.  

For an 8-way system (dual SCI connected 4-way PentiumPro):

o  300 MHz * 4 CPUs * 1 memory reference/cycle = 1200M memory refs/sec
o  1% of 1200M = 12 Million cache misses/second/quad PentiumPro
o  12 Million * .5 = 6M cache cycles/sec/quad PentiumPro on SCI
               ^^^ probability of cache line being on local node
o  6M * 2 nodes * 4 SCI transactions/cache cycle = 48M SCI transactions/sec
                  ^^^^^ extremely conservative estimate!
o  Transaction Latency (Max Avg.) = 1/48M = 208 ns
o  Minimum Bw required = (3*16 + 80) * 12M = 1536 MB/s
                          ^^^    ^^ 3 small packets, 1 large

It appears that the cache traffic alone will choke an SCI ring in 1997.

Based on the above results it would seem that SCI would restricted to being
a powerful but less interesting distributed shared memory interconnect.

Of course a 2GB/s SCI standard (is one coming out soon?) would solve this in
the short term (if this was even possible to build today). However it seems
that higher SCI speeds would only delay the inevitable considering how fast
CPU clock speeds are increasing

Any clues what we are doing wrong here?

Keith Muller
muller@cs.ucsd.edu
University of California, San Diego 


