next up previous contents
Next: The Cray Inc. XD1. Up: Recount of (almost) available ... Previous: The Cray SX-6.

The Cray Inc. X1.

Machine type Shared-memory multi-vector processor.
Models X1 (cluster).
Operating system UNICOS (Cray Unix variant).
Connection structure Crossbar.
Compilers Fortran 90, C, C++, Co-Array Fortran, UPC.
Vendors information Web page www.cray.com/products/x1/
Year of introduction 2002.

System parameters:

Model Cray X1
Clock cycle 800 MHz
Theor. peak performance
Per Proc. (64 bits) 3.2/12.8 Gflop/s
Maximal 819 Gflop/s
Memory <= 1 TB
No. of processors 16--64 (MSP, see below)
Memory bandwidth
Memory-Cache 34.1 GB/s
Cache-CPU 76.8 GB/s

Remarks:

The hardware structure of the Cray X1 is almost identical to that of the former Cray SV1ex (see the Disappeared Systems section). The clock frequency was raised from 500 to 800 MHz and the amount of memory and the bandwidth to the processors are increased. Each processor board contains 4 CPUs that can deliver a peak rate of 4 floating-point operations per cycle, amounting to a theoretical peak performance of 3.2 Gflop/s per CPU. However, 4 CPUs can be coupled across CPU boards in a configuration to form a so-called Multi Streaming Processor (MSP) resulting in a processing unit that has effectively a Theoretical Peak Performance of 12.8 Gflop/s. The reconfiguration into MSPs and/or single CPU combinations can be done dynamically as the workload dictates. The vector start-up time for the single CPUs is smaller than for MSPs, so for small vectors single CPUs might be preferable while for programs containing long vectors the MSPs should be of advantage. MSP mode is regarded as the standard mode of operation. This is also visible in the processor count given in the data sheets of Cray: the maximum number within one cabinet is 64 MSP processors. This is equivalent to 256 SSP processors. In the present Cray optimisation documentation it is said that the Cray Programming Environment is as yet not optimised for SSP processing which is not to say that suitable programs would not run efficiently in SSP mode.

The relative bandwidth both from memory to the CPU boards and from the cache to the CPUs has improved much in comparison to the predecessor SV1ex: from memory to the CPU board 5.3 8-byte operands can be transferred. From the cache the peak bandwidth to the CPUs is 12 8-byte operands, enough to sustain dyadic operations. The cache structure is rather complex each of the 4 SSPs on a board have their own 16 KB 2-way set-associative L1 data and instruction cache. The the L1 data cache only stores scalar data. The L2 cache is 2MB in size and is shared by the SSP processors on the CPU board.

New features that are less visible to the user are adherence to the IEEE 754 floating-point standard arithmetic and a new vector instruction set that can make better use of the new features like caches and addressability and synchronisation of remote nodes. This is because every cabinet can be regarded as a node in a cluster of which a maximum of 64 can be configured in what is called a modified 2-D torus topology. Cray itself regards a board with 4 MSPs as a ``node''. Each node has two connections to the outside world. Odd and even nodes are connected in pairs and the other connection from the board is connected via a switch to the other boards. Thus requiring at most two hops to reach any other MSP in the cabinet. The aggregate bandwidth in such a fully populated cabinet is 400 GB/s. Multi-cabinet configurations are extending the 2-D torus into a 3-D torus structure much like the late Cray T3E. Latency and bandwidth data for point-to-point communication are not provided but various measurements in an MPI environment have been done, see the Measured Performances below.
On a 4-SSP CPU board OpenMP can be employed. When accessing other CPU boards one can use Cray's shmem library for one-sided communication, MPI, Co-Array Fortran, etc.

Measured Performances: In [42] a speed of 5895 Gflop/s is reported for solving a 494,592-order linear system on a 504-(MSP)processor machine at Oak Ridge National Laboratory. This amounts to an efficiency of 91%.
A more extensive evaluation of the ORNL system is reported in [10]. Here a point-to-point bandwidth of 13.9 GB/s was found within a 4-MSP node and 11.9 GB/s between nodes. MPI latencies for small messages were 8.2 and 8.6 µs, respectively. For shmem and Co-Array Fortran the latencies were only 3.8 and 3.0 µs.



next up previous contents
Next: The Cray Inc. XD1. Up: Recount of (almost) available ... Previous: The Cray SX-6.

Aad van der Steen
Tue Oct 12 16:26:22 CEST 2004