Machine type | Shared-memory multi-processor |
---|---|
Models | Power Challenge L, XL |
Operating system | IRIX (SGI's Unix variant) |
Compilers | Fortran 77, C, C++ , Pascal |
Vendors information Web page | http://www.sgi.com/Products/hardware/Power/ |
System parameters:
Model | Model L | Model XL |
---|---|---|
Clock cycle | 13.3 ns | 13.3 ns |
Theor. peak performance | ||
Per proc. (64-bit) | 300 Mflop/s | 300 Mflop/s |
Maximal (64-bit) | 1.8 Gflop/s | 5.4 Gflop/s |
Main memory | <=6 GB | <=16 GB |
Memory bandwidth | ||
Proc. to cache/proc. | 1.2 GB/s | 1.2 GB/s |
Main memory/cache | 1.2 GB/s | 1.2 GB/s |
No. of processors | 2-6 | 2-18 |
Remarks:
The Power Challenge systems are shared-memory multiple-instruction multiple-data parallel (MIMD) computers. So, several different instructions can be going on at the same time using different data items in these instructions. All data are stored in a single shared memory from which the multiple processors draw the data items they need and in which the results are stored again. In most high performance systems the main problem is to provide the CPUs with enough data and to transport the results back at such a rate that they can be kept busy continuously. In this, the Powerchallenge is no exception. The data is transported from the main memory to the CPUs by a central bus. The so-called POWERpath-2 bus is 256 bits wide and has a bandwidth of 1.2 GB/s. This is very fast as busses go but even then the data rates that are needed by the CPUs cannot possibly be fulfilled when no special provisions would exist. These provisions are present in the form of data and instruction caches for each of the CPUs.
The Power Challenge series uses MIPS R8000 RISC processors (formerly called the TFP processor standing for True Floating Point) with a nominal peak speed of 300 Mflop/s. Although the clock rate of this processor is two times lower than that of its predecessor, the R4400, the performance is 4 times higher. As the need for data is even higher than that of the R4400 processors with this speed of processing, there is a special extra cache called the ``Streaming cache'' of up to 16 MB. This is very large and it should reduce the bus traffic as much as possible. All floating-point operations are done by streaming the operands from this large off-chip cache to the floating-point registers. In contrast to the R4400 processor, the R8000 is able to do a combined multiply-add operation which in many cases doubles the operation speed. In addition, the floating-point functional units are doubled with respect to the R4400 which should explain the four-fold increase in performance with respect to this predecessor.
Power Challenge systems can be coupled by HiPPI channels to form a cluster of systems using very efficient ``shared-memory'' PVM and MPI implementations that can be used homogeneously (for the user) both within a single Power Challenge system and between them. This could be used for the solution of extremely large application problems. Such clusters are called Power Challenge Arrays by SGI. SGI wants to extend this technique by providing faster coupling and switching between the systems. This trend is also to be seen with other vendors (see 3.4.6 and 3.3.5(SX-4)).
Power Challenge systems can be coupled by HiPPI channels to form a Parallelisation is done either automatically by the (Fortran or C) compiler or explicitly by the user, mainly through the use of directives. As synchronisation, etc., has to be done via memory the parallelisation overhead is fairly large. In fact, experiments as reported in citebmtut show that a distributed memory implementation of the same problem can be much faster on a single PowerChallenge.
Measured Performances:
On a SGI PowerChallenge Array equipped with 128 processors a performance of
26.7 Gflop/s was measured when solving an order 53,000 dense linear system
[#linpackbm#
Next: The Tera MTA
Up: Shared-memory MIMD systems
Previous: The NEC SX-4.
Jack Dongarra
Sat Feb 10 15:12:38 EST 1996