Machine type | Shared-memory multi-vector processor. |
---|---|
Models | X1 (cluster). |
Operating system | UNICOS (Cray Unix variant). |
Connection structure | Crossbar. |
Compilers | Fortran 90, C, C++, Co-Array Fortran, UPC. |
Vendors information Web page | www.cray.com/products/x1/ |
Year of introduction | 2002. |
System parameters:
Model | Cray X1 |
---|---|
Clock cycle | 800 MHz |
Theor. peak performance | |
Per Proc. (64 bits) | 3.2/12.8 Gflop/s |
Maximal | 819 Gflop/s |
Memory | <= 1 TB |
No. of processors | 16--64 (MSP, see below) |
Memory bandwidth | |
Memory-Cache | 34.1 GB/s |
Cache-CPU | 76.8 GB/s |
Remarks:
The hardware structure of the Cray X1 is almost identical to that of the former Cray SV1ex (see the Disappeared Systems section). The clock frequency was raised from 500 to 800 MHz and the amount of memory and the bandwidth to the processors are increased. Each processor board contains 4 CPUs that can deliver a peak rate of 4 floating-point operations per cycle, amounting to a theoretical peak performance of 3.2 Gflop/s per CPU. However, 4 CPUs can be coupled across CPU boards in a configuration to form a so-called Multi Streaming Processor (MSP) resulting in a processing unit that has effectively a Theoretical Peak Performance of 12.8 Gflop/s. The reconfiguration into MSPs and/or single CPU combinations can be done dynamically as the workload dictates. The vector start-up time for the single CPUs is smaller than for MSPs, so for small vectors single CPUs might be preferable while for programs containing long vectors the MSPs should be of advantage. MSP mode is regarded as the standard mode of operation. This is also visible in the processor count given in the data sheets of Cray: the maximum number within one cabinet is 64 MSP processors. This is equivalent to 256 SSP processors. In the present Cray optimisation documentation it is said that the Cray Programming Environment is as yet not optimised for SSP processing which is not to say that suitable programs would not run efficiently in SSP mode.
The relative bandwidth both from memory to the CPU boards and from the cache to the CPUs has improved much in comparison to the predecessor SV1ex: from memory to the CPU board 5.3 8-byte operands can be transferred. From the cache the peak bandwidth to the CPUs is 12 8-byte operands, enough to sustain dyadic operations. The cache structure is rather complex each of the 4 SSPs on a board have their own 16 KB 2-way set-associative L1 data and instruction cache. The the L1 data cache only stores scalar data. The L2 cache is 2MB in size and is shared by the SSP processors on the CPU board.
New features that are less visible to the user are adherence to the IEEE 754
floating-point standard arithmetic and a new vector instruction set that can
make better use of the new features like caches and addressability and
synchronisation of remote nodes. This is because every cabinet can be regarded
as a node in a cluster of which a maximum of 64 can be configured in what is
called a modified 2-D torus topology. Cray itself regards a board with 4 MSPs
as a ``node''. Each node has two connections to the outside world. Odd and even
nodes are connected in pairs and the other connection from the board is
connected via a switch to the other boards. Thus requiring at most two hops to
reach any other MSP in the cabinet. The aggregate bandwidth in such a fully
populated cabinet is 400 GB/s. Multi-cabinet configurations are extending the
2-D torus into a 3-D torus structure much like the late
Cray T3E. Latency and bandwidth data for
point-to-point communication are not provided but various measurements in an
MPI environment have been done, see the Measured Performances below.
On a 4-SSP CPU board OpenMP can be employed. When accessing other CPU boards
one can use Cray's shmem library for one-sided communication, MPI,
Co-Array Fortran, etc.
Measured Performances:
In [42] a speed of 5895 Gflop/s is
reported for solving a 494,592-order linear system on a 504-(MSP)processor
machine at Oak Ridge National Laboratory. This amounts to an efficiency of 91%.
A more extensive evaluation of the ORNL system is reported in
[10].
Here a point-to-point bandwidth of 13.9 GB/s was found within a 4-MSP node and
11.9 GB/s between nodes. MPI latencies for small messages were 8.2 and 8.6
µs, respectively. For shmem and Co-Array Fortran the latencies
were only 3.8 and 3.0 µs.