The Cray Research Inc. Cray J90-series, T90 series.

Next: The Hitachi S3800 series. Up: Shared-memory MIMD systems Previous: Shared-memory MIMD systems

The Cray Research Inc. Cray J90-series, T90 series.

Machine type Shared-memory multi-vectorprocessor
Models Cray J90, T90
Operating system UNICOS (Cray Unix variant)
Compilers Fortran, C, C++, Pascal, ADA
Vendors information Web page http://www.cray.com

Machine type	Shared-memory multi-vectorprocessor
Models	Cray J90, T90
Operating system	UNICOS (Cray Unix variant)
Compilers	Fortran, C, C++, Pascal, ADA
Vendors information Web page	http://www.cray.com

System parameters:

Model Cray J90 Cray T90
Clock cycle 10 ns 2.2 ns
Theor. peak performance
Per processor 200 Mflop/s 1.8 Gflop/s
Maximal 3.2 Gflop/s 58Gflop/s

Main memory <=4 GB <=8 GB
Memory bandwidth
Single proc. bandwidth 1.6 GB/s 24 GB/s

No. of processors 4-32 1-32

Model	Cray J90	Cray T90
Clock cycle	10 ns	2.2 ns
Theor. peak performance
Per processor	200 Mflop/s	1.8 Gflop/s
Maximal	3.2 Gflop/s	58Gflop/s
Main memory	<=4 GB	<=8 GB
Memory bandwidth
Single proc. bandwidth	1.6 GB/s	24 GB/s
No. of processors	4-32	1-32

Remarks:

Cray Research Inc. (CRI) supports at this moment 3 product lines (apart from the SuperSparc-based CS6400 which is targeted to the commercial market and is not discussed in this report). Two of these are multi-headed vector processors which are discussed here. The third is the T3E, a DM-MIMD machine that will be described in section 3.4.

The Cray J90 series is the entry level model marketed by CRI announced in September 1994. The J90 series is based on CMOS technology which has a low power consumption (all J90s are air cooled) and low production costs. The machine is binary compatible with the high-end systems. It has one multiply and add vector pipe set per CPU at a clock cycle of 10 ns which results in a theoretical peak performance of 200 Mflop/s. Furthermore, a cache has been added to speed up scalar processing (as in the Convex C4 series, see 3.3.3). It is interesting to note that the strategy of using more (four) multi-functional pipes as in the predecessor, the Y-MP EL has been left again to return to the classic two-pipe/CPU design.

The Cray T90 series is built in ECL logic and has therefore a much lower clock cycle (2.2 ns) and correspondingly faster SRAM memory. As its direct predecessor, the Cray C90, every CPU contains two vector add and multiply pipes. This gives rise to a maximum of 4 floating-point results/clock cycle/CPU equivalent to a theoretical peak performance of 1.8 Gflop/s per CPU or 58 Gflop/s for a maximal system.

The Cray T90 machines are at this moment the only ones with a memory bandwidth as seems optimal for vector processors: two operands can be loaded and one result can be stored in one cycle for each pipe set. For the T90 this meant that the relative bandwidth has to be 48 bytes/cycle/CPU. This has indeed been accomplished and observed results indicate that for the T90 the performance scales up with the clock cycle and the number of functional units (see measured performances below). For the J90 series the bandwidth is lower: 16 bytes/cycle. This is regrettably less than was available in its predecessors, Y-MP EL machines, and it might adversely affect the efficiency.

Another property that is unique for the Cray T90 systems is that they do not have a separate scalar processor but that scalar- and vector code have to share the same functional units. However, a small scalar cache is added to speed up scalar calculations. The Cray J90 series has separate scalar processors. Theoretically, the absence of separate scalar processors might impair the throughput speed (Hitachi (3.3.2) even adds an extra scalar processor in the S-3800 series to combat excessive context switching). However, in practice the drawbacks seem rather limited.

Contrary to earlier high-end Cray systems, the T90 now features compatibility with the IEEE 754 floating-point standard. Formerly, Cray-specific floating-point arithmetic was employed which could give rise to problems in data exchange with other systems and in different computational results due to the difference in arithmetic.

Measured Performances: On the T90 in [#linpackbm##1#] a speed of 29.4 Gflop/s was found on a 32 processor machine for the solution of an order 1000 dense linear system. For a J90 series machine with 32 processors a speed of 4.486 Gflop/s was observed for the same problem, which amounts to efficiencies of 51 and 70% for the T90 and the J90, respectively.

Next: The Hitachi S3800 series. Up: Shared-memory MIMD systems Previous: Shared-memory MIMD systems

Jack Dongarra
Sat Feb 10 15:12:38 EST 1996