next up previous contents
Next: The Meiko Computing Surface 2. Up: Distributed-memory MIMD systems Previous: The HP/Convex Exemplar SPP-2000.

The IBM 9076 SP2

Machine type RISC-based distributed-memory multi-processor cluster
Models IBM9076 SP2
Operating system AIX (IBMs Unix variant)
Connection structure omega-switch-switch
Compilers XL Fortran (Fortran 90), HPF, XL C, C++
Vendors information Web page www.rs6000.ibm.com/hardware/largescale/index.html
Year of introduction 1997.

System parameters:

Model 9076 SP2
Clock cycle 6.25 ns
Theor. peak performance
Per Proc. (64-bits) 640 Mflop/s
Maximal variable (see remarks)
Main memory
Memory/node <= 1/2 GB (see remarks)
Memory/maximal <= 1 TB
Communication bandwidth
Point-to-point 150 MB/s
No. of processors 8--512

Remarks:

Presently, three types of node processors are available for the SP2. The P2SC thin nodes, P2SC wide nodes and the 604e ``High node''. The latter is in fact a cluster of up to 8 604e processor connected to a common memory in an SMP fashion. The processors have a clock cycle of 5 ns, however, they can only deliver 2 floating-point results/cycle at maximum, while the P2SC nodes can deliver 4 floating-point results per clock cycle. as the fastest of the P2SC nodes has a clock cycle of 6.25 ns, it has a peak performance of 640 Mflop/s while a single 604e processor can attain 400 Mflop/s at maximum. It seems that the 604e processor-based systems are more targetted to the commercial market. The P2SC-based systems are primarily meant for the technical/scientific market. In the parameter list above we included the presently fastest P2SC processor.

The SP2 configurations are housed in columns that each can contain 8--16 processor nodes. This depends on the type of node employed: a thin nodes occupies half of the space of a wide node. Although the processors in these nodes are basically the same there are some differences. At the time of writing no 6.25 ns clock wide nodes were available yet. The fastest in this class feature a clock cycle of 7.4 ns giving a peak speed of 540 Mflop/s. Also thin nodes with a clock cycle of 8.3 ns are still around, having a peak speed of 480 Mflop/s.

Wide nodes have the double amount of microchannel slots (8 instead of 4) as compared to the thin nodes. Furthermore, the maximum memory of a wide node can be 2 GB whereas the maximum for thin nodes is 1 GB. IBM envisions the wide node more or less as server for a column and recommends configurations of one wide node packaged with 14 thin nodes per column (although this may differ with the needs of the user). The SP2 is accessed through a front-end control workstation that also monitors system failures. Failing nodes can be taken off line and exchanged without interrupting service. In addition, fileservers can be connected to the system while every node can have up to 2 GB. This can greatly speed up applications with significant I/O requirements.

Although we mentioned only the highest speed option for the communication, the high-performance switch, there is a wide range of other options that could be chosen instead: Ethernet, Token Ring, FDDI, etc., are all possible. The new version of the high-performance switch has greatly improved over the former model with a bandwidth of 150 MB/s point-to-point instead of the former 40 MB/s. Unfortunately, the online (semi)technical information of IBM is not very helpful in providing more detailed information with regard to the other properties of the switch so, for instance, a bisection bandwidth cannot be provided at this point. The high-performance switch has some redundancy built into it for greater reliability. The structure is that of a multi-stage crossbar ( omega-switch-switch).

Applications can be run using PVM or MPI. Also High Performance Fortran is supported, both a proprietary version and a compiler from the Portland Group. IBM uses its own PVM version from which the data format converter XDR has been stripped. This results in a lower overhead at the cost of generality. Also the MPI implementation, MPI-F, is optimised for the SP2 systems.

Measured Performances: In [2] a performance of 151.8 Gflop/s is mentioned for a 460 processor 8.33 ns P2SC thin node machine in solving a 61000-order dense linear system. A 128-processor system using 6.25 ns processors solved a similar problem of order 39000 at a speed of 57.2 Gflop/s which amounts to an effiency of 70%.



next up previous contents
Next: The Meiko Computing Surface 2. Up: Distributed-memory MIMD systems Previous: The HP/Convex Exemplar SPP-2000.



Aad van der Steen
Mon Feb 16 09:51:27 MET 1998