Machine type | RISC-based distributed-memory multi-processor |
---|---|
Models | T3E-1350 |
Operating system | UNICOS/mk (micro kernel-based Unix) |
Connection structure | 3-D Torus |
Compilers | Fortran 77, Fortran 90, HPF, ANSI C, C++. |
Vendors information Web page | www.cray.com/products/systems/crayt3e/ |
Year of introduction | T3E-1200E: 1998 T3E-1350: 2000. |
System parameters:
Model | T3E-1350 |
---|---|
Clock cycle | 675 MHz |
Theor. peak performance | |
Per proc. (64-bits) | 1.35 Gflop/s |
Maximal | 2938 Gflop/s |
Main memory | |
Memory/node | <= 512 MB |
Memory/maximal | <= 1 TB |
No. of processors | 40-2176 |
Communication bandwidth | |
Point-to-point | 325 MB/s |
Remarks:
The T3E is the second generation of DM-MIMD systems from Cray. Lexically, it follows in name after its predecessor T3D which name referred to its interconnection structure: a 3-D torus. The T3E still has this interconnection structure. The systems are all liquid cooled in contrast to the earlier T3E systems of which the systems up to 128 processors could rely on air cooling.
The T3E uses the DEC Alpha 21164 for its computational tasks. In 2000, a T3E-1350 was introduced that uses the latest 21164A processors at a clock rate of 675 MHz. Cray stresses, that the processors are encapsulated in such a way that they can be exchanged easily for any other (faster) processor as soon as this would be available without affecting the macro-architecture of the system. However, in practice this is not likely to happen.
Each node in the system contains one processing element (PE) which in turn contains a CPU, memory, and a communication engine that takes care of communication between PEs. The bandwidth between nodes is quite high: 325 MB/s. Like the T3D, its predecessor, the T3E has hardware support for fast synchronisation. E.g., barrier synchronisation takes only one cycle per check. The node also contains a set of E-registers and streaming registers that allows for aggressive prefetching to ameliorate the restrictions of the processor/memory bottleneck. An interesting additional feature is the availability of 32 contexts per processor which opens the door for multiprocessing.
In the T3E distributed I/O is present. For every 8 PEs an I/O channel
can be configured in the air-cooled systems and 1 I/O channel per 16
nodes in the liquid-cooled systems. The maximum bandwidth for a channel
is about 1 GB/s, the actual speed will be in the order of 500 MB/s.
The T3E supports various programming models. Apart from PVM and MPI for message passing and HPF for data distribution, a Cray proprietary one-sided communication library, the so-called shmem library can be employed for message passing. In addition, the BSP library (see [12]), also a one-sided message passing library is available. The shmem library is implemented close to the hardware and shows very low latency of only 1.6 µs.
There are some differences in the available configurations between the T3E-1200 and the T3E-1350: In the T3E-1200 the amount of memory per node ranges from 64 MB to 2 GB while in the 1350 model there is only a choice between 256 and 512 MB per node. Furthermore, there is an air-cooled model (up to 128 PEs) of the T3E-1200 while the larger configurations are liquid-cooled. The T3E-1350 knows only liquid-cooled configurations that can be incremented from 40 processors on with modules of 8 processors. The 1200 systems start at 6 processors and modules of 4 or 8 processors can be added.
Measured Performances:
In [6] a speed of 1.127 Tflop/s is
reported for the solution of a dense linear system of order 148800 on a
T3E-1200 with 1488 processors. The efficiency for such an exercise is
63%. The same source quotes a speed of 113.9 out of 172.8 Gflop/s on a
128-processor T3E-1350, giving an efficiency of 66% for solving a size
89,088 linear system.