|Machine type||RISC-based distributed-memory multi-processor|
|Operating system||UNICOS MAX (micro-kernel Unix)|
|Connection structure||3-D Torus|
|Compilers||CFT77_M (Fortran 77 with extensions), C|
|Vendors information Web page||http://www.cray.com/PUBLIC/product-info/T3E/|
|Clock cycle||3.3 ns|
|Theor. peak performance|
|Per proc. (64-bit)||600 Mflop/s|
|Maximal (64-bit)||1229 Gflop/s|
|Main memory||<=4096 GB|
|Memory/node||<= 2 GB|
|Communication bandwidth||300 MB/s|
|No. of processors||16-2048|
The T3E is the second generation of DM-MIMD systems from CRI. Lexically, it follows in name after its predecessor T3D which name referred to its connection structure: a 3-D torus. In this respect it has still the same interconnection structure as the T3D. In many other respects, however, there are quite some differences. A first and important difference is that no front-end system is required anymore (although it is still possible to connect to a Cray T90). The systems up to 128 processors are air-cooled. The larger ones, from 256-2,048 processors, are liquid cooled.
The T3E uses the DEC Alpha 21164 RISC processor for its computational tasks just like the Avalon A12. Cray stresses, however, that the processors are encapsulated in such a way that they can be exchanged easily for any other (faster) processor as soon as this would be available without affecting the macro-architecture of the system.
Each node in the system contains one processing element (PE) which in turn contains a CPU, memory, and a communication engine that takes care of communication between PEs. The bandwidth between nodes is quite high: 300 MB/s. Like the T3D, the T3E has hardware support for fast synchronisation. E.g., barrier synchronisation takes only one cycle per check.
In the microarchitecture most changes have taken place with the transition from the T3D to the T3E. First, there is only one CPU per node instead of two, which removes a source of asymmetry between processors. Second, the new node processor has a 96 KB 3-way set-associative secondary cache which may relieve some of the problems of data fetching that were present in the T3D where only a primary cache was present. Third, the Block Transfer Engine has been replaced by a set of E-registers that are believed to be much more flexible and at least removes some odd restrictions on the size of shared arrays and the number of processes when using Cray-specific PVM. An interesting additional feature is the availability of 32 contexts per processor which opens the door for multiprocessing.
In the T3D all I/O had to be handled by the front-end, a system at least from the Cray Y-MPE class. In the T3E distributed I/O is present. For every 8 PEs an I/O channel can be configured in the air-cooled systems and 1 I/O channel per 16 nodes in the liquid-cooled systems. The maximum bandwidth for a channel is about 1 GB/s, the actual speed will be in the order of 700 MB/s.
The T3E supports various programming models. Apart from PVM 3.x for message passing and HPF for data distribution, a Cray proprietary work sharing model, called CRAFT, can be employed. Cray views HPF and Fortran 90 array syntax as subsets of the CRAFT model. Within this model data can be exchanged implicitly, thus looking effectively as a shared-memory system to the user. As several other vendors, Cray has extended/altered the implementation of PVM to enhance the communication performance. For small messages this can give an improvement of a factor 3 (20-25 s instead of 70-80 s). For SPMD programs channel send/receive functions can be used which reduces the communication time to 4-5 s. The faster implementations are not portable, however.
Measured Performances: The Cray T3E has only recently been announced (November 1995). At this moment no performance figures are available.