|Machine type||Distributed-memory multi-processor|
|Operating system||Internal OS transparent to the user, SunOS (Sun's Unix variant) on the front-end system|
|Compilers||Extended Fortran 77, ANSI C, C++|
|Clock cycle||10 ns|
|Theor. peak performance|
|Per Proc. (64-bits)||100 Mflop/s|
|Main memory||<=1 TB|
|No. of processors||8-10244|
The nCUBE 3 is presently the only commercially available machine with a hypercube structure. The nCUBE uses in-house developed processors implemented in 0.5 m CMOS which have a performance of 100 Mflop/s in 64-bit precision (in contrast to the former 2S model the new processor is entirely 64-bit wide). The node processor has 8 KB instruction an data caches, both 2-way set associative. Furthermore, each processor has miss and write buffers of four operands deep that allows for 4 cache misses (or deferring four cache writes) before disturbing the data cache.
There are 16 outward DMA channels per node (8 send and 8 receive) for inter-processor communication while an additional one is used for the distributed I/O system which therefore has the nice property that it scales with the number of nodes. The speed of these I/O nodes is 20 MB/s full-duplex. The communication latency is quite low: about 3 s while the single channel bandwidth is 50 MB/s. By ``folding'' multiple channels higher point-to-point bandwidth can be achieved. For the instruction cache an autoprefetch facility is implemented while prefetch for the data cache is compiler directed. On 1024 processors with 6 ports folded a bisectional bandwidth of 45.5 GB/s can be realised.
Apart from the fixed wormhole routing scheme that already was employed in the former nCUBE systems, a new fault-tolerant adaptive routing scheme is available. This scheme is also essentially a wormhole routing but with the additional constraint that after the first hop the distance to the target node should strictly decrease. Therefore, no cycles can occur and delivering of a message is guaranteed to be done in a finite number of hops.
Within the hypercube sub-cubes can be allocated to accommodate more users. A queue of tasks is set up with (sub)-cubes of the required size. Programs may be written to determine the sub-cube dimensions just before execution.
Measured Performances: The first system is expected to be realised in the 2nd quarter of 1995, so no real performance figures are available. Simulations showed a single node speed of 96 Mflop/s on a matrix-matrix multiply and of 40 Mflop/s on a matrix-vector multiply.