Machine type | RISC-based distributed-memory multi-processor |
---|---|
Models | T3E, T3E-900, T3E-1200 |
Operating system | UNICOS/mk (micro kernel-based Unix) |
Connection structure | 3-D Torus |
Compilers | Fortran 77, Fortran 90, HPF, ANSI C, C++. |
Vendors information Web page | www.cray.com/products/systems/crayt3e/ |
Year of introduction | T3E, T3E-900: 1996, T3E-1200: 1997. |
System parameters:
Model | T3E | T3E-900 | T3E-1200 |
---|---|---|---|
Clock cycle | 3.3 ns | 2.2 ns | 1.67 ns |
Theor. peak performance | |||
Per proc. (64-bits) | 600 Mflop/s | 900 Mflop/s | 1200 Mflop/s |
Maximal | 1229 Gflop/s | 1843 Gflop/s | 2458 Gflop/s |
Main memory | |||
Memory/node | <= 2 GB | <= 2 GB | <= 2 GB |
Memory/maximal | <= 4 TB | <= 4 TB | <= 4 TB |
Communication bandwidth | 300 MB/s | 300 MB/s | 300 MB/s |
No. of processors | 16-2048 | 6-2048 | 6-2048 |
Remarks:
The T3E is the second generation of DM-MIMD systems from CRI. Lexically, it follows in name after its predecessor T3D which name referred to its connection structure: a 3-D torus. In this respect it has still the same interconnection structure as the T3D. In many other respects, however, there are quite some differences. A first and important difference is that no front-end system is required anymore (although it is still possible to connect to a Cray T90). The systems up to 128 processors are air-cooled. The larger ones, from 256-2,048 processors, are liquid cooled.
The T3E uses the DEC Alpha 21164 RISC processor for the T3E and the 21164A processor for the T3E-900 for its computational tasks just like the Avalon A12. Recently, a T3E-1200 is introduced that uses 21164A processors at a clock rate of only 1.67 ns but that identical in any other aspect to the T3E-900. Cray stresses that the processors are encapsulated in such a way that they can be exchanged easily for any other (faster) processor as soon as this would be available without affecting the macro-architecture of the system.
Each node in the system contains one processing element (PE) which in turn contains a CPU, memory, and a communication engine that takes care of communication between PEs. The bandwidth between nodes is quite high: 300 MB/s. Like the T3D, the T3E has hardware support for fast synchronisation. E.g., barrier synchronisation takes only one cycle per check.
In the microarchitecture most changes have taken place with the transition from the T3D to the T3E. First, there is only one CPU per node instead of two, which removes a source of asymmetry between processors. Second, the new node processor has a 96 KB 3-way set-associative secondary cache which may relieve some of the problems of data fetching that were present in the T3D where only a primary cache was present. Third, the Block Transfer Engine has been replaced by a set of E-registers and streaming registers that are much more flexible and that remove some odd restrictions on the size of shared arrays and the number of processes when using Cray-specific PVM. An interesting additional feature is the availability of 32 contexts per processor which opens the door for multiprocessing.
In the T3D all I/O had to be handled by the front-end, a system at least from the Cray Y-MP/E class. In the T3E distributed I/O is present. For every 8 PEs an I/O channel can be configured in the air-cooled systems and 1 I/O channel per 16 nodes in the liquid-cooled systems. The maximum bandwidth for a channel is about 1 GB/s, the actual speed will be in the order of 700 MB/s.
The T3E supports various programming models. Apart from PVM3 and MPI for message passing and HPF for data distribution, a Cray proprietary one-sided communication library, the so-called shmem library can be employed for message passing. In addition, the BSP library (see [7]), also a one-sided message passing library is available. The shmem library is implemented close to the hardware and shows very low latency of only 1.6 µs.
Measured Performances: In [2] a speed of 670 Gflop/s is reported for the solution of a dense linear system of order 128832 on a T3E-900 with 1320 processors. The efficiency for such an exercise is 56%.