The Cray Inc. T3E.

Next: The Cray Inc. X1. Up: Recount of (almost) available ... Previous: The Cray Inc. SV1ex.

The Cray Inc. T3E.

Machine type RISC-based distributed-memory multi-processor
Models T3E-1350
Operating system UNICOS/mk (micro kernel-based Unix)
Connection structure 3-D Torus
Compilers Fortran 77, Fortran 90, HPF, ANSI C, C++.
Vendors information Web page www.cray.com/products/systems/crayt3e/
Year of introduction T3E-1200E: 1998 T3E-1350: 2000.

Machine type	RISC-based distributed-memory multi-processor
Models	T3E-1350
Operating system	UNICOS/mk (micro kernel-based Unix)
Connection structure	3-D Torus
Compilers	Fortran 77, Fortran 90, HPF, ANSI C, C++.
Vendors information Web page	www.cray.com/products/systems/crayt3e/
Year of introduction	T3E-1200E: 1998 T3E-1350: 2000.

System parameters:

Model T3E-1350
Clock cycle 675 MHz
Theor. peak performance
Per proc. (64-bits) 1.35 Gflop/s
Maximal 2938 Gflop/s
Main memory
Memory/node <= 512 MB
Memory/maximal <= 1 TB
No. of processors 40-2176
Communication bandwidth
Point-to-point 325 MB/s

Model	T3E-1350
Clock cycle	675 MHz
Theor. peak performance
Per proc. (64-bits)	1.35 Gflop/s
Maximal	2938 Gflop/s
Main memory
Memory/node	<= 512 MB
Memory/maximal	<= 1 TB
No. of processors	40-2176
Communication bandwidth
Point-to-point	325 MB/s

Remarks:

The T3E is the second generation of DM-MIMD systems from Cray. Lexically, it follows in name after its predecessor T3D which name referred to its interconnection structure: a 3-D torus. The T3E still has this interconnection structure. The systems are all liquid cooled in contrast to the earlier T3E systems of which the systems up to 128 processors could rely on air cooling.

The T3E uses the DEC Alpha 21164 for its computational tasks. In 2000, a T3E-1350 was introduced that uses the latest 21164A processors at a clock rate of 675 MHz. Cray stresses, that the processors are encapsulated in such a way that they can be exchanged easily for any other (faster) processor as soon as this would be available without affecting the macro-architecture of the system. However, in practice this is not likely to happen.

Each node in the system contains one processing element (PE) which in turn contains a CPU, memory, and a communication engine that takes care of communication between PEs. The bandwidth between nodes is quite high: 325 MB/s. Like the T3D, its predecessor, the T3E has hardware support for fast synchronisation. E.g., barrier synchronisation takes only one cycle per check. The node also contains a set of E-registers and streaming registers that allows for aggressive prefetching to ameliorate the restrictions of the processor/memory bottleneck. An interesting additional feature is the availability of 32 contexts per processor which opens the door for multiprocessing.

In the T3E distributed I/O is present. For every 8 PEs an I/O channel can be configured in the air-cooled systems and 1 I/O channel per 16 nodes in the liquid-cooled systems. The maximum bandwidth for a channel is about 1 GB/s, the actual speed will be in the order of 500 MB/s.

The T3E supports various programming models. Apart from PVM and MPI for message passing and HPF for data distribution, a Cray proprietary one-sided communication library, the so-called shmem library can be employed for message passing. In addition, the BSP library (see [12]), also a one-sided message passing library is available. The shmem library is implemented close to the hardware and shows very low latency of only 1.6 µs.

There are some differences in the available configurations between the T3E-1200 and the T3E-1350: In the T3E-1200 the amount of memory per node ranges from 64 MB to 2 GB while in the 1350 model there is only a choice between 256 and 512 MB per node. Furthermore, there is an air-cooled model (up to 128 PEs) of the T3E-1200 while the larger configurations are liquid-cooled. The T3E-1350 knows only liquid-cooled configurations that can be incremented from 40 processors on with modules of 8 processors. The 1200 systems start at 6 processors and modules of 4 or 8 processors can be added.

Measured Performances:
In [6] a speed of 1.127 Tflop/s is reported for the solution of a dense linear system of order 148800 on a T3E-1200 with 1488 processors. The efficiency for such an exercise is 63%. The same source quotes a speed of 113.9 out of 172.8 Gflop/s on a 128-processor T3E-1350, giving an efficiency of 66% for solving a size 89,088 linear system.

Next: The Cray Inc. X1. Up: Recount of (almost) available ... Previous: The Cray Inc. SX-6.

Aad van der Steen
Mon Nov 3 09:57:02 CET 2003