Newsgroups: comp.parallel
From: shore@dinah.tc.cornell.edu (Melinda Shore)
Subject: KSR reliability (was Re: SMP vs. MPP)
Keywords: MPP, SMP, PVP
Organization: No Mountain Software
Date: Mon, 2 Jan 1995 19:24:59 GMT
Message-ID: <3e9jq9$1l2k@theory.tc.cornell.edu>

In article <mjrD1n3BL.KuL@netcom.com> mjr@netcom.com (Mark Rosenbaum) writes:
>Much has been said of the business problems of KSR but I would
>be interested in hearing from KSR users and exemployees and how
>well the system was doing technically. My understanding is that
>there was a reliability problem that was either hardware or 
>software related.

The KSR/1-128 here at the Theory Center is doing quite
well.  We're seeing uptimes in excess of a month under
significant load.  The software problems (and there have
been some significant ones) are pretty much what you'd
expect to see in a young product - I wouldn't say that
they've been any worse than those in early releases of
Unicos, for example.

The one things that's been disturbing is the high rate of
component failure.  There were some obvious fabrication
problems that have since been corrected, and there's been
excessive sensitivity to heat stress.  Also, there's
obviously going to be a scaling problem - if the
expected mtbf of, say, an aprd is approximately 10 years,
if you've got a 128-cell machine you might reasonably
expect an aprd failure every month or so (although I'll
grant that the failures aren't going to distribute
randomly).

All in all I've been more impressed with the machine and
the software than I have with anything else I've seen in
the past few years (although I haven't seen an SCI-based
machine yet).  I'm *really* disappointed to see the
technology disappear because of corporate bungling.
--
        Melinda Shore - No Mountain Software - shore@tc.cornell.edu
                       I don't speak for Cornell.
          If you send me harassing email, I'll probably post it


