The nCUBE 3.

Next: The NEC Cenju-3. Up: Distributed-memory MIMD systems Previous: The Meiko Computing Surface

The nCUBE 3.

Machine type Distributed-memory multi-processor
Models nCUBE 3
Operating system Internal OS transparent to the user, SunOS (Sun's Unix variant) on the front-end system
Connection structure Hypercube
Compilers Extended Fortran 77, ANSI C, C++

Machine type	Distributed-memory multi-processor
Models	nCUBE 3
Operating system	Internal OS transparent to the user, SunOS (Sun's Unix variant) on the front-end system
Connection structure	Hypercube
Compilers	Extended Fortran 77, ANSI C, C++

System parameters:

Model nCUBE 3
Clock cycle 10 ns
Theor. peak performance
Per Proc. (64-bits) 100 Mflop/s
Maximal(64-bits) 1024 Gflop/s
Main memory <=1 TB
Memory/node <=1 GB
Communication bandwidth
Point-to-point 115 MB/s

No. of processors 8-10244

Model	nCUBE 3
Clock cycle	10 ns
Theor. peak performance
Per Proc. (64-bits)	100 Mflop/s
Maximal(64-bits)	1024 Gflop/s
Main memory	<=1 TB
Memory/node	<=1 GB
Communication bandwidth
Point-to-point	115 MB/s
No. of processors	8-10244

Remarks:

The nCUBE 3 is presently the only commercially available machine with a hypercube structure. The nCUBE uses in-house developed processors implemented in 0.5 m CMOS which have a performance of 100 Mflop/s in 64-bit precision (in contrast to the former 2S model the new processor is entirely 64-bit wide). The node processor has 8 KB instruction an data caches, both 2-way set associative. Furthermore, each processor has miss and write buffers of four operands deep that allows for 4 cache misses (or deferring four cache writes) before disturbing the data cache.

There are 16 outward DMA channels per node (8 send and 8 receive) for inter-processor communication while an additional one is used for the distributed I/O system which therefore has the nice property that it scales with the number of nodes. The speed of these I/O nodes is 20 MB/s full-duplex. The communication latency is quite low: about 3 s while the single channel bandwidth is 50 MB/s. By ``folding'' multiple channels higher point-to-point bandwidth can be achieved. For the instruction cache an autoprefetch facility is implemented while prefetch for the data cache is compiler directed. On 1024 processors with 6 ports folded a bisectional bandwidth of 45.5 GB/s can be realised.

Apart from the fixed wormhole routing scheme that already was employed in the former nCUBE systems, a new fault-tolerant adaptive routing scheme is available. This scheme is also essentially a wormhole routing but with the additional constraint that after the first hop the distance to the target node should strictly decrease. Therefore, no cycles can occur and delivering of a message is guaranteed to be done in a finite number of hops.

Within the hypercube sub-cubes can be allocated to accommodate more users. A queue of tasks is set up with (sub)-cubes of the required size. Programs may be written to determine the sub-cube dimensions just before execution.

Measured Performances: The first system is expected to be realised in the 2nd quarter of 1995, so no real performance figures are available. Simulations showed a single node speed of 96 Mflop/s on a matrix-matrix multiply and of 40 Mflop/s on a matrix-vector multiply.

Next: The NEC Cenju-3. Up: Distributed-memory MIMD systems Previous: The Meiko Computing Surface

Jack Dongarra
Sat Feb 10 15:12:38 EST 1996