Machine type | Distributed-memory multi-processor |
---|---|
Models | nCUBE 3 |
Operating system | Internal OS transparent to the user, SunOS (Sun's Unix variant) on the front-end system |
Connection structure | Hypercube |
Compilers | Extended Fortran 77, ANSI C, C++ |
System parameters:
Model | nCUBE 3 |
---|---|
Clock cycle | 10 ns |
Theor. peak performance | |
Per Proc. (64-bits) | 100 Mflop/s |
Maximal(64-bits) | 1024 Gflop/s |
Main memory | <=1 TB |
Memory/node | <=1 GB |
Communication bandwidth | |
Point-to-point | 115 MB/s |
No. of processors | 8-10244 |
Remarks:
The nCUBE 3 is presently the only commercially available machine with a hypercube structure. The nCUBE uses in-house developed processors implemented in 0.5 m CMOS which have a performance of 100 Mflop/s in 64-bit precision (in contrast to the former 2S model the new processor is entirely 64-bit wide). The node processor has 8 KB instruction an data caches, both 2-way set associative. Furthermore, each processor has miss and write buffers of four operands deep that allows for 4 cache misses (or deferring four cache writes) before disturbing the data cache.
There are 16 outward DMA channels per node (8 send and 8 receive) for inter-processor communication while an additional one is used for the distributed I/O system which therefore has the nice property that it scales with the number of nodes. The speed of these I/O nodes is 20 MB/s full-duplex. The communication latency is quite low: about 3 s while the single channel bandwidth is 50 MB/s. By ``folding'' multiple channels higher point-to-point bandwidth can be achieved. For the instruction cache an autoprefetch facility is implemented while prefetch for the data cache is compiler directed. On 1024 processors with 6 ports folded a bisectional bandwidth of 45.5 GB/s can be realised.
Apart from the fixed wormhole routing scheme that already was employed in the former nCUBE systems, a new fault-tolerant adaptive routing scheme is available. This scheme is also essentially a wormhole routing but with the additional constraint that after the first hop the distance to the target node should strictly decrease. Therefore, no cycles can occur and delivering of a message is guaranteed to be done in a finite number of hops.
Within the hypercube sub-cubes can be allocated to accommodate more users. A queue of tasks is set up with (sub)-cubes of the required size. Programs may be written to determine the sub-cube dimensions just before execution.
Measured Performances: The first system is expected to be realised in the 2nd quarter of 1995, so no real performance figures are available. Simulations showed a single node speed of 96 Mflop/s on a matrix-matrix multiply and of 40 Mflop/s on a matrix-vector multiply.