The Silicon Graphics Power Challenge

Next: The Tera MTA Up: Shared-memory MIMD systems Previous: The NEC SX-4.

The Silicon Graphics Power Challenge

Machine type Shared-memory multi-processor
Models Power Challenge L, XL
Operating system IRIX (SGI's Unix variant)
Compilers Fortran 77, C, C++ , Pascal
Vendors information Web page http://www.sgi.com/Products/hardware/Power/

Machine type	Shared-memory multi-processor
Models	Power Challenge L, XL
Operating system	IRIX (SGI's Unix variant)
Compilers	Fortran 77, C, C++ , Pascal
Vendors information Web page	http://www.sgi.com/Products/hardware/Power/

System parameters:

Model Model L Model XL
Clock cycle 13.3 ns 13.3 ns
Theor. peak performance
Per proc. (64-bit) 300 Mflop/s 300 Mflop/s
Maximal (64-bit) 1.8 Gflop/s 5.4 Gflop/s

Main memory <=6 GB <=16 GB
Memory bandwidth
Proc. to cache/proc. 1.2 GB/s 1.2 GB/s
Main memory/cache 1.2 GB/s 1.2 GB/s
No. of processors 2-6 2-18

Model	Model L	Model XL
Clock cycle	13.3 ns	13.3 ns
Theor. peak performance
Per proc. (64-bit)	300 Mflop/s	300 Mflop/s
Maximal (64-bit)	1.8 Gflop/s	5.4 Gflop/s
Main memory	<=6 GB	<=16 GB
Memory bandwidth
Proc. to cache/proc.	1.2 GB/s	1.2 GB/s
Main memory/cache	1.2 GB/s	1.2 GB/s
No. of processors	2-6	2-18

Remarks:

The Power Challenge systems are shared-memory multiple-instruction multiple-data parallel (MIMD) computers. So, several different instructions can be going on at the same time using different data items in these instructions. All data are stored in a single shared memory from which the multiple processors draw the data items they need and in which the results are stored again. In most high performance systems the main problem is to provide the CPUs with enough data and to transport the results back at such a rate that they can be kept busy continuously. In this, the Powerchallenge is no exception. The data is transported from the main memory to the CPUs by a central bus. The so-called POWERpath-2 bus is 256 bits wide and has a bandwidth of 1.2 GB/s. This is very fast as busses go but even then the data rates that are needed by the CPUs cannot possibly be fulfilled when no special provisions would exist. These provisions are present in the form of data and instruction caches for each of the CPUs.

The Power Challenge series uses MIPS R8000 RISC processors (formerly called the TFP processor standing for True Floating Point) with a nominal peak speed of 300 Mflop/s. Although the clock rate of this processor is two times lower than that of its predecessor, the R4400, the performance is 4 times higher. As the need for data is even higher than that of the R4400 processors with this speed of processing, there is a special extra cache called the ``Streaming cache'' of up to 16 MB. This is very large and it should reduce the bus traffic as much as possible. All floating-point operations are done by streaming the operands from this large off-chip cache to the floating-point registers. In contrast to the R4400 processor, the R8000 is able to do a combined multiply-add operation which in many cases doubles the operation speed. In addition, the floating-point functional units are doubled with respect to the R4400 which should explain the four-fold increase in performance with respect to this predecessor.

Power Challenge systems can be coupled by HiPPI channels to form a cluster of systems using very efficient ``shared-memory'' PVM and MPI implementations that can be used homogeneously (for the user) both within a single Power Challenge system and between them. This could be used for the solution of extremely large application problems. Such clusters are called Power Challenge Arrays by SGI. SGI wants to extend this technique by providing faster coupling and switching between the systems. This trend is also to be seen with other vendors (see 3.4.6 and 3.3.5(SX-4)).

Power Challenge systems can be coupled by HiPPI channels to form a Parallelisation is done either automatically by the (Fortran or C) compiler or explicitly by the user, mainly through the use of directives. As synchronisation, etc., has to be done via memory the parallelisation overhead is fairly large. In fact, experiments as reported in citebmtut show that a distributed memory implementation of the same problem can be much faster on a single PowerChallenge.

Measured Performances: On a SGI PowerChallenge Array equipped with 128 processors a performance of 26.7 Gflop/s was measured when solving an order 53,000 dense linear system [#linpackbm##1#]

Next: The Tera MTA Up: Shared-memory MIMD systems Previous: The NEC SX-4.

Jack Dongarra
Sat Feb 10 15:12:38 EST 1996