The Cray MTA-2

Next: The Cray SX-6. Up: Recount of (almost) available ... Previous: The C-DAC PARAM Padma.

The Cray MTA-2

Machine type Distributed-memory multi-processor
Models MTAx, x = 16, 32,...,256
Operating system Unix BSD4.4 + proprietary micro kernel
Compilers Fortran 77/90, ANSI C, C++
Vendors information Web page http://www.cray.com/products/systems/mta/
Year of introduction 2001.

Machine type	Distributed-memory multi-processor
Models	MTAx, x = 16, 32,...,256
Operating system	Unix BSD4.4 + proprietary micro kernel
Compilers	Fortran 77/90, ANSI C, C++
Vendors information Web page	http://www.cray.com/products/systems/mta/
Year of introduction	2001.

System parameters:

Model MTAx
Clock cycle ---
Theor. peak performance
Per proc. (64-bit) >= 750 Mflop/s
Maximal (64-bit) >= 192 Gflop/s

Main memory <= 1 TB
No. of processors 16--256

Model	MTAx
Clock cycle	---
Theor. peak performance
Per proc. (64-bit)	>= 750 Mflop/s
Maximal (64-bit)	>= 192 Gflop/s
Main memory	<= 1 TB
No. of processors	16--256

Remarks:

The exact peak speed of the MTA systems cannot be given as the clock cycle is a well kept secret on the data sheets of the Cray MTA. Only some lower bounds of the peak performance are quoted. Systems with a 200 MHz and 227 MHz clocks exist, but the latter cannot be the target clock frequency because it is lower by a factor of about 3 with respect to the speed consistent with that in the Cray documentation.

Let us look at the architectural features: Although the memory in the MTA is physically distributed, the system is emphatically presented as a shared memory machine (with non-uniform access time). The latency incurred in memory references is hidden by multi-threading, i.e., usually many concurrent program threads (instruction streams) may be active at any time. Therefore, when for instance a load instruction cannot be satisfied because of memory latency the thread requesting this operation is stalled and another thread of which an operation can be done is switched into execution. This switching between program threads only takes 1 cycle. As there may be up to 128 instruction streams and 8 memory references can be issued without waiting for preceding ones, a latency of 1024 cycles can be tolerated. References that are stalled are retried from a retry pool. A construction that works out similarly is to be found in the Stern Computing Systems SSP machines.

The connection network connects a 3-D cube of p processors with sides of p^1/3 of which alternately the x- or y axes are connected. Therefore, all nodes connect to four out of six neighbours. In a p processor system the worst case latency is 4.5p^1/3 cycles; the average latency is 2.25p^1/3 cycles. Furthermore, there is an I/O port at every node. Each network port is capable of sending and receiving a 64-bit word per cycle which amounts to a bandwidth of 5.33 GB/s per port. In case of detected failures, ports in the network can be bypassed without interrupting operations of the system.

Although the MTA should be able to run "dusty-deck" Fortran programs because parallelism is automatically exploited as soon as an opportunity is detected for multi-threading, it may be (and often is) worthwhile to explicitly control the parallelism in the program and to take advantage of known data locality occurrences. MTA provides handles for this in the form of library routines, including synchronisation, barrier, and reduction operations on defined groups of threads. Controlled and uncontrolled parallelism approaches may be freely mixed. Furthermore, each variable has a full/empty bit associated with it which can be used to control parallelism and synchronisation with almost zero overhead. An interesting use of this property is made in DNA sequence alignment (see [4]).
A first MTA system with 28 processors (instead of the normal 32) was installed at the Naval Research Lab, USA, in 2002.

Measured Performances:
The company has also delivered a 16-processor system to the San Diego Supercomputing Center. This system runs at a clock cycle of 4.4 ns instead of the once planned 3 ns. Consequently, the peak performance of a processor is 450 Mflop/s. Using the EuroBen Benchmark a performance of 388 Mflop/s out of 450 Mflop/s was found for an order 800 matrix-vector multiplication, an efficiency of 86%. For 1-D FFTs up to 1 million elements a speed of 106 Mflop/s was found on 1 processor and the about the same speed on 4 processors due to an insufficient availability of parallel threads.
Also limited performance data from the NAS parallel benchmark, [27], are given in [5] in which efficiencies of over around 90% were attained on 4 processors with speeds of 700--750 Mflop/s for the CG, FT, and MG kernels.
More recent performance experiments are described in [5] that especially address the scalability for application codes.

Next: The Cray SX-6. Up: Recount of (almost) available ... Previous: The C-DAC PARAM Padma.

Aad van der Steen
Tue Oct 12 15:33:12 CEST 2004