Machine type | RISC-based ccNUMA system. |
---|---|
Models | HP 9000 SuperDome. |
Operating system | HP-UX (HP's usual Unix flavour) |
Connection structure | Crossbar |
Compilers | Fortran 77, Fortran 90, Parallel Fortran, HPF, C, C++ |
Vendors information Web page | http://www.hp.com/products1/servers/scalableservers/superdome/ |
Year of introduction | 2000, 2004 with PA-RISC 8800. |
System parameters:
Model | HP 9000 SuperDome |
---|---|
Clock cycle | 1 GHz |
Theor. peak performance | |
Per proc. (64-bits) | 8 Gflop/s |
Maximal (64-bits) | 512 Gflop/s |
Main memory | |
Memory/node | ≤ 64 GB |
Memory/maximal | 1 TB |
No. of processors | ≤ 64 |
Communication bandwidth | |
aggregate (global) | 64 GB/s |
(cell—backplane) | 8 GB/s |
(within cell, see below) | 16 GB/s |
Remarks:
The Superdome replaced the Exemplar V2600 system which has been withdrawn by HP (see section Systems Disappeared from the List). The connection structure of the Superdome has significantly improved over that of the former V2600. The Superdome has a 2-level crossbar: one level within a 4-processor cell and another level by connecting the cells the crossbar backplane. Every cell connects to the backplane at a speed of 8 GB/s and the global aggregate bandwidth for a fully configured system is therefore 64 GB/s.
As said, the basic building block of the Superdome is the 4-processor cell. All data traffic within a cell is controlled by the Cell Controller, a 10-port ASIC. It connects to the four local memory subsystems at 16 GB/s, to the backplane crossbar at 8 GB/s, and to two ports that each serve two processors at 6.4 GB/s/port. As each processor houses two CPU cores the available bandwidth per CPU core is 1.6 GB/s. Like the SGI Altix systems, the cache coherency in the Superdome is secured by using directory memory. The NUMA factor for a full 64 processor systems is by HP's account very modest: only 1.8.
The PA-RISC 8800 processors run at a clock frequency
of 1 GHz. As each processor contains two processor cores which in turn
contain 2 floating-point units that are able to execute a combined floating
multiply-add instruction, in favourable circumstances 8 flops/cycle can be
achieved and a Theoretical Peak Performance of 8 Gflop/s per processor can be
attained. This amounts to a peak speed of 512 Gflop/s for a full configuration.
Because a shared-memory parallel model is supported over the entire system,
OpenMP can be employed on the total of 64 processors (128 CPU cores).
The
Superdome can be partitioned in different complexes that run with different
processors, e.g., the Itanium 2. In that case the same backplane can be used
but the cells are of a different type. In theory one therefore can have a mixed
HP 9000 Superdome and an Integrity Superdome
(see below).
Measured Performances:
From the new model with the dual-core PA-RISC 8800 processors no performance
results (in the HPC realm) are known to the author, the system in on the market
from April 2004. In [42] a speed of 756
Gflop/s is reported for solving a full linear system of unspecified size. This
result is achieved on an older 8-way coupled system with a total of 512 PA-RISC
8700+ processors at 875 MHz. As the Theoretical Peak Performance of such a
cluster is 1792 Gflop/s the efficiency is 42%.