next up previous contents
Next: The Fujitsu AP3000. Up: Distributed-memory MIMD systems Previous: The C-DAC PARAM OpenFrame system.

The Cray Research Inc. T3E.

Machine type RISC-based distributed-memory multi-processor
Models T3E, T3E-900, T3E-1200
Operating system UNICOS/mk (micro kernel-based Unix)
Connection structure 3-D Torus
Compilers Fortran 77, Fortran 90, HPF, ANSI C, C++.
Vendors information Web page www.cray.com/products/systems/crayt3e/
Year of introduction T3E, T3E-900: 1996, T3E-1200: 1997.

System parameters:

Model T3E T3E-900 T3E-1200
Clock cycle 3.3 ns 2.2 ns 1.67 ns
Theor. peak performance
Per proc. (64-bits) 600 Mflop/s 900 Mflop/s 1200 Mflop/s
Maximal 1229 Gflop/s 1843 Gflop/s 2458 Gflop/s
Main memory
Memory/node <= 2 GB <= 2 GB <= 2 GB
Memory/maximal <= 4 TB <= 4 TB <= 4 TB
Communication bandwidth 300 MB/s 300 MB/s 300 MB/s
No. of processors 16-2048 6-2048 6-2048

Remarks:

The T3E is the second generation of DM-MIMD systems from CRI. Lexically, it follows in name after its predecessor T3D which name referred to its connection structure: a 3-D torus. In this respect it has still the same interconnection structure as the T3D. In many other respects, however, there are quite some differences. A first and important difference is that no front-end system is required anymore (although it is still possible to connect to a Cray T90). The systems up to 128 processors are air-cooled. The larger ones, from 256-2,048 processors, are liquid cooled.

The T3E uses the DEC Alpha 21164 RISC processor for the T3E and the 21164A processor for the T3E-900 for its computational tasks just like the Avalon A12. Recently, a T3E-1200 is introduced that uses 21164A processors at a clock rate of only 1.67 ns but that identical in any other aspect to the T3E-900. Cray stresses that the processors are encapsulated in such a way that they can be exchanged easily for any other (faster) processor as soon as this would be available without affecting the macro-architecture of the system.

Each node in the system contains one processing element (PE) which in turn contains a CPU, memory, and a communication engine that takes care of communication between PEs. The bandwidth between nodes is quite high: 300 MB/s. Like the T3D, the T3E has hardware support for fast synchronisation. E.g., barrier synchronisation takes only one cycle per check.

In the microarchitecture most changes have taken place with the transition from the T3D to the T3E. First, there is only one CPU per node instead of two, which removes a source of asymmetry between processors. Second, the new node processor has a 96 KB 3-way set-associative secondary cache which may relieve some of the problems of data fetching that were present in the T3D where only a primary cache was present. Third, the Block Transfer Engine has been replaced by a set of E-registers and streaming registers that are much more flexible and that remove some odd restrictions on the size of shared arrays and the number of processes when using Cray-specific PVM. An interesting additional feature is the availability of 32 contexts per processor which opens the door for multiprocessing.

In the T3D all I/O had to be handled by the front-end, a system at least from the Cray Y-MP/E class. In the T3E distributed I/O is present. For every 8 PEs an I/O channel can be configured in the air-cooled systems and 1 I/O channel per 16 nodes in the liquid-cooled systems. The maximum bandwidth for a channel is about 1 GB/s, the actual speed will be in the order of 700 MB/s.

The T3E supports various programming models. Apart from PVM3 and MPI for message passing and HPF for data distribution, a Cray proprietary one-sided communication library, the so-called shmem library can be employed for message passing. In addition, the BSP library (see [7]), also a one-sided message passing library is available. The shmem library is implemented close to the hardware and shows very low latency of only 1.6 µs.

Measured Performances: In [2] a speed of 670 Gflop/s is reported for the solution of a dense linear system of order 128832 on a T3E-900 with 1320 processors. The efficiency for such an exercise is 56%.



next up previous contents
Next: The Fujitsu AP3000. Up: Distributed-memory MIMD systems Previous: The C-DAC PARAM OpenFrame system.



Aad van der Steen
Thu Feb 12 13:12:37 MET 1998