The IBM 9076 SP2

Next: The Intel Paragon XP. Up: Distributed-memory MIMD systems Previous: The HP/Convex Exemplar SPP-1200.

The IBM 9076 SP2

Machine type RISC-based distributed-memory multi-processor cluster
Models IBM9076 SP2
Operating system AIX (IBMs Unix variant)
Connection structure Dependent on type of connection (see remarks)
Compilers XL Fortran, XL C, XL C++
Vendors information Web page http://ibm.tc.cornell.edu/ibm/pps/sp2/sp2.html

Machine type	RISC-based distributed-memory multi-processor cluster
Models	IBM9076 SP2
Operating system	AIX (IBMs Unix variant)
Connection structure	Dependent on type of connection (see remarks)
Compilers	XL Fortran, XL C, XL C++
Vendors information Web page	http://ibm.tc.cornell.edu/ibm/pps/sp2/sp2.html

System parameters:

Model 9076 SP2
Clock cycle 15 ns
Theor. peak performance
Per Proc. (64-bit) 267 Mflop/s
Maximal (64-bit) 34.1 Gflop/s

Memory/node 64-512/2048 MB (see remarks)
Communication bandwidth
Point-to-point 20+ MB/s
Bisectional 25 GB/s

No. of processors 8-128

Model	9076 SP2
Clock cycle	15 ns
Theor. peak performance
Per Proc. (64-bit)	267 Mflop/s
Maximal (64-bit)	34.1 Gflop/s
Memory/node	64-512/2048 MB (see remarks)
Communication bandwidth
Point-to-point	20+ MB/s
Bisectional	25 GB/s
No. of processors	8-128

Remarks:

As a basis for the computational nodes in the SP2 RS/6000 processors with a clock cycle of 15 ns are used. This amounts to a peak performance of 266 Mflop/s per node because the floating-point units of the SP2 processors can deliver up to 4 results/cycle. The SP2 configurations are housed in columns that each can contain 8-16 processor nodes. This depends on the type of node employed: there are two types, thin nodes and wide nodes. Although the processors in these nodes are basically the same there are some differences. Wide nodes have the double amount of microchannel slots (8 instead of 4) as compared to the thin nodes. Furthermore, the maximum memory of a wide node can be 2 GB whereas the maximum for thin nodes is 512 MB. More important in terms of performance is the fact that the data cache of a wide node is four times larger than that of a thin node (256 KB instead of 64 KB) and that the memory bus is two times wider than that of a thin node (8 instead of 4 words/cycle). The latter differences explain than a performance gain of a factor 1.5 has been observed for wide nodes over the thin nodes. However, the newer Thin-node2 is except with regard to the number of micro-channel slots almost identical to a wide node. Also the performance is very simlar to that of a wide node (see Measured performance). IBM envisions the wide node more or less as server for a column and recommends configurations of one wide node packaged with 14 thin nodes per column (although this may differ with the needs of the user). The SP2 is accessed through a front-end control workstation that also monitors system failures. Failing nodes can be taken off line and exchanged without interrupting service. In addition, fileservers can be connected to the system while every node can have up to 2 GB. This can greatly speed up applications with significant I/O requirements.

There is a choice in the way communication is done: Ethernet, Token Ring, FDDI, etc., are all possible. However, it is also possible to connect the processors by an optional high-speed switch with a speed of 40 MB/s. Therefore, depending on the communication type the speed can range from 1-40 MB/s. The high-speed switch has some redundancy built into it for greater reliability. The structure is that of a multi-stage crossbar ( -switch).

Applications can be run using PVM or Express. FORGE 90 MIMDizer can be used to assist in parallelising the code by generating the necessary calls to PVM or Express communication routines. Under Express Fortran 77 or 90, C, and C++ can be used. Also High Performance Fortran is supported. IBM uses its own PVM version from which the data format converter XDR has been stripped. This results in a lower overhead at the cost of generality. Recently an optimised version of MPI has also become available.

Measured Performances: In [#linpackbm##1#] a performance of 88.4 Gflop/s in solving a dense linear system of order N=73,500 with 512 Thin-node2 nodes. In [#nasbm##1#] it appears that at 128 nodes the Thin-node2 is consistently slower than the Wide-node1. The differences range from 4-20% with an average of about 9%. The Wide-node1 times for the Class B problems EP, MG, CG, FT, IS, LU, SP, and BT are 4.99, 2.46, 25.44, 14.52, 1.98, 47.8, 54.8, and 67.0 seconds, respectively.

Next: The Intel Paragon XP. Up: Distributed-memory MIMD systems Previous: The HP/Convex Exemplar SPP-1200.

Jack Dongarra
Sat Feb 10 15:12:38 EST 1996