The Hitachi SR11000.

Next: The HP 9000 SuperDome. Up: Recount of (almost) available ... Previous: The Fujitsu/Siemens PRIMEPOWER.

The Hitachi SR11000.

Machine type RISC-based distributed memory multi-processor
Models SR11000 H1.
Operating system AIX (IBM's Unix variant).
Connection structure Mult-dimensional crossbar (see remarks)
Compilers Fortran 77, Fortran 95, Parallel Fortran, C, C++
Vendors information Web page www.hitachi.co.jp/Prod/comp/hpc/SR_e/11ktop_e.html
Year of introduction 2003.

Machine type	RISC-based distributed memory multi-processor
Models	SR11000 H1.
Operating system	AIX (IBM's Unix variant).
Connection structure	Mult-dimensional crossbar (see remarks)
Compilers	Fortran 77, Fortran 95, Parallel Fortran, C, C++
Vendors information Web page	www.hitachi.co.jp/Prod/comp/hpc/SR_e/11ktop_e.html
Year of introduction	2003.

System parameters:

Model SR11000 H1
Clock cycle 1.7 GHz
Theor. peak performance
Per proc. (64-bits) 109 Gflop/s
Maximal 27.8 Tflop/s
Main memory
Memory/node ≤ 64 GB
Memory/maximal 16.4 TB
No. of processors 4--256
Communication bandwidth
Point-to-point 12 GB/s (bidirectional)

Model	SR11000 H1
Clock cycle	1.7 GHz
Theor. peak performance
Per proc. (64-bits)	109 Gflop/s
Maximal	27.8 Tflop/s
Main memory
Memory/node	≤ 64 GB
Memory/maximal	16.4 TB
No. of processors	4--256
Communication bandwidth
Point-to-point	12 GB/s (bidirectional)

Remarks:

The SR11000 is the fourth generation of distributed-memory parallel systems of Hitachi. It replaces its predecessor, the SR8000 (see Systems Disappeared from the List). Presently only one model is available, the SR11000 H1 in contrast to the SR8000 of which no less than 4 models existed with different processor clock speeds and bandwidths.

The basic node processor is a 1.7 GHz POWER4+ from IBM. Unlike in the former SR2201 and SR8000 systems no modification of the processor its done to make it fit for Hitachi's Pseudo Vector Processing, a technique that enabled the processing of very long vectors without the detrimental effects that normally occur when out-of-cache data access is required. Presumably Hitachi is now relying on advanced prefetching of data to bring about the same effect.

The peak performance per basic processor, or IP, can be attained with 2 simultaneous multiply/add instructions resulting in a speed of 6.8 Gflop/s on the SR11000. However, 16 basic processors are coupled to form one processing node all addressing a common part of the memory. For the user this node is the basic computing entity with a peak speed of 108.8 Gflop/s. Hitachi refers to this node configuration as COMPAS, Co-operative Micro-Processors in single Address Space. In fact this is a kind of SMP clustering as discussed in the sections on the main architectural classes and ccNUMA machines. A difference with most of these systems is that for the user the individual processors in a cluster node are not accessible. Every node also contains an SP, a system processor that performs system tasks, manages communication with other nodes and a range of I/O devices.

The SR11000 has a multi-dimensional crossbar with a single-directional link speed of 12 GB/s. Also here IBM technology is used: the IBM Federation Switch fabric is used, be it in a different topology than IBM does for its own p690 servers. From 4--8 nodes the cross-section of the network is 1 hop. For configurations 16--64 it is 2 hops and from 128-node systems on it is 3 hops.

Like in some other systems as the Cray XD1, and the AlphaServer SC, and the late NEC Cenju-4, one is able to directly access the memories of remote processors. Together with the very fast hardware-based barrier synchronisation this should allow for writing distributed programs with very low parallelisation overhead.

Of course the usual communication libraries like PVM and MPI are provided. In case one uses MPI it is possible to access individual IPs within the nodes. Furthermore, in one node it is possible to use OpenMP on individual IPs. Mostly this is less efficient than using the automatic parallelisation as done by Hitachi's compiler but in case one offers coarser grained task parallelism via OpenMP a performance gain can be attained. Hitachi provides its own numerical libraries to solve dense and sparse linear systems, FFTs, etc. As yet it is not known whether third party numerical libraries like NAG and IMSL are available.
It is expected that early in 2005 a POWER5-based SR11000 will come onto the market.

Measured Performances:
The SR11000 was introduced by the end of 2003. A few systems have been sold in Japan in the mean time but as yet no performance results are published for the system.

Next: The HP 9000 SuperDome. Up: Recount of (almost) available ... Previous: The Fujitsu/Siemens PRIMEPOWER.

Aad van der Steen
Wed Oct 13 11:01:49 CEST 2004