Newsgroups: comp.parallel
From: pfister@austin.ibm.com (Pfister)
Subject: Re: SCI Out of Gas at Starting Gate?
Keywords: SCI
Organization: IBM Austin, RISC System/6000 Division
Date: Thu, 30 Nov 1995 15:18:50 GMT
Message-ID: <49ki0q$1ct6@ausnews.austin.ibm.com>


In article <49fk7s$mtc@sdcc12.ucsd.edu>, muller@sdcc33.ucsd.edu (Keith Muller) writes:
> 
> While doing some very rough performance estimates the other day with a grad 
> student about using SCI rings to create a CC-NUMA SMP some very odd results
> came up.  Does anyone see what is wrong in either assumptions or calculations?
[snip]
> Here are the assumptions:
> 
> o PentiumPro issues 1 memory reference/cycle
> o 300 MHz clock rate
> o Second level cache miss rate of 1%
> o SMP programming model (no optimization for NUMA)
[snip]
> For an 8-way system (dual SCI connected 4-way PentiumPro):
[snip]
> o  Minimum Bw required = (3*16 + 80) * 12M = 1536 MB/s
>                           ^^^    ^^ 3 small packets, 1 large
> 
> It appears that the cache traffic alone will choke an SCI ring in 1997.
[snip] 
> Any clues what we are doing wrong here?

IMHO: assumption 4, use of just SCI, and assumption 1.

NUMA systems must assume that something is done to increase the node-locality
of reference.  If that is done, they can scream; if not, the inter-node
bandwidth and latency restrictions make them die compared to SMPs.

The "something" need not always be explicit application coding.  That's clearly
possible, but OS things can help a lot without explicit application
modifications.  For example, it will obviously help a lot to allocate a
process's memory on the node where it's running.  Damfino what
numeric assumptions to make about locality at this time.

What I mean by "just SCI" is this.  Take a look at the system Sequent announced,
which is right along the lines of what you describe above:  NUMA-coupled 4-way
SMPs, Intel-based.  In addition to the (SCI (-like?)) NUMA stuff, they tacked a
32 MegaByte (that is not a typo) cache on each SMP node to hold data from other
nodes.  Obviously this is going to cut down on inter-node traffic.  How much,
in practice, I've no idea.  I'd be interested to find out whether they do.

Finally, I think your assumption of 1 reference/cycle is fairly big.  This
depends on the application, of course; some technical codes can require >1
operand per FLOP.  But a reasonable average is around 30-40% of all instructions
being load or store (see Hennessey & Patterson).  That more than halves the
requirement right off the bat, if you assume 1 cycle/instruction.  It's not
enough to fix the problem,however, which is why I started with the other issues.

Greg
-- 
________________________________________________________________________
Greg Pfister           |     My Opinion Only     | Phone: (512) 838-8338
pfister@austin.ibm.com | Sic Crustulum Frangitur |   Fax: (512) 838-5989


