Newsgroups: comp.parallel
From: sanders@i90s19.ira.uka.de (Peter Sanders)
Subject: Re: Collective Communication
Organization: Universitaet Karlsruhe, Germany
Date: 21 Mar 1995 16:51:25 GMT

In article <3kkc6k$2eb@usenet.srv.cis.pitt.edu> Robert van de Geijn <rvdg@cs.utexas.edu> writes:

   Path: rz.uni-karlsruhe.de!xlink.net!howland.reston.ans.net!gatech!newsfeed.pitt.edu!bigrigg
   From: Robert van de Geijn <rvdg@cs.utexas.edu>
   Newsgroups: comp.parallel
   Date: 17 Mar 1995 22:39:55 -0600
   Organization: CS Dept, University of Texas at Austin
   Lines: 48
   Approved: bigrigg@cs.pitt.edu
   NNTP-Posting-Host: unixd4.cis.pitt.edu
   Originator: bigrigg@unixd4.cis.pitt.edu

   In article 11965, Roger Butenuth and Peter Sanders write:

      In article <3jkn8n$i05@usenet.srv.cis.pitt.edu> rvdg@cs.utexas.edu
      (Robert van de Geijn) writes:

      |>                   vector        NX        InterCom      ratio
      |>     Operation     length       (sec)       (sec)     (NX/InterCom) 
      |>   -----------------------------------------------------------------
      |>
      |>     Broadcast     8 bytes      0.0017      0.0014        1.21
      |>                 64K bytes      0.0356      0.0069        5.18
      |>                  1M bytes      0.5788      0.0493       11.75
      |>
      |>   Global Sum      8 bytes      0.0032      0.0029        1.10
      |>     to all      64K bytes      0.3780      0.0195       19.35 
      |>                  1M bytes      5.9353      0.1791       33.15

      We were surprised about this performance data, because the operating
      system Cosy (Concurrent Operating SYstem) with the Library PIGSeL is
      about as fast as the Paragon, considerung it is running on Transputers
      (T805, 30MHz), with only 30 MIPS / 2 MFLOPS / 1.7 MB/s bandwith on its
      links.
[...]
			  vector        NX        InterCom      Cosy
	    Operation     length       (sec)       (sec)        (sec)
	  -------------------------------------------------------------

	  Global Sum      8 bytes      0.0032      0.0029      0.0062
	    to all                                             ======


   Notice that this is a performance number for SMALL message length, for
   which bandwidth is meaningless.

Yes. But there are many application for which bandwidth IS meaningless
because collective operations are mainly used for coordination
purposes there.  If bandwidth was your primary objective this may not
always be justified.
One thing we wanted to find out with our posting is, what makes the
Paragon so slow.  Apparently the start-up overhead is the limiting
factor for short messages. But how high is it for your application?
There are wildly differing and confusing values given for different
benchmarks.  Does anybody now how NX's overhead can be further
subdivided, for example into kernel entry, copy, synchronization,
buffer maintenance etc.?

   All this indicates it that the
   transputer network has communication startup twice as high as the
   Paragon. 
 
No. Our system is limited by hop-to-hop routing latency which is about
1000 times higher than on the Paragon (store and forward versus
wormhole). The Cosy router needs about 30 microseconds to forward a
packet to the next node.

   I always thought that one of the advantages of transputer
   networks was that communication startup was relatively small....

That depends on the functionality you want to have.  If you strictly
follow the CSP-like programming model of the Transputer with
physically placed channels you have negligible startup overheads.  But
Cosy is meant to be a portable high level operating system for
parallel computers supporting fully transparent routing and real
multitasking (without polling network interfaces or other things that
waste CPU-cycles). Considering this, a startup overhead lower than in
NX is quite good for a machine with a 20--100 times lower CPU
performance (2 MFLOPS).

   Could the authors kindly provide performance numbers for 64K bytes and
   1M bytes?

Cosy achieves 800 KB/s for 64 KB messages for multicasts (52% of peak
bandwidth vs. 6% for InterCom).  The multicast code can handle
arbitrary sets of processes and arbitrary networks (e.g. grid,
hypercube, perfect shuffle etc.).


Peter Sanders, Roger Butenuth

