|Machine type||RISC-based distributed memory multi-processor|
|Operating system||HI-UX/MPP (Micro kernel Mach 3.0)|
|Connection structure||Hyper crossbar (see remarks)|
|Compilers||Fortran 77, Fortran 90, Parallel Fortran, HPF, C, C++|
|Vendors information Web page||www.hitachi.co.jp/Prod/comp/hpc/eng/sr1.html|
|Year of introduction||1996.|
|Clock cycle||6.7 ns|
|Theor. peak performance|
|Per proc. (64-bits)||300 Mflop/s|
|Memory/node||<= 256 MB|
|Memory/maximal||<= 256 GB|
|Communication bandwidth||300 MB/s|
|No. of processors||32--1024|
The SR2201 is the second generation of distributed memory parallel systems of Hitachi. The basic node processor is again an Hitachi implementation of the PA-RISC architecture of HP running at a clock cycle of 6.7 ns. However, in contrast with its predecessor, the SR2001, in the SR2201 the node processors are somewhat modified to allow for "pseudo vector processing" (both hardware and instructions). This means that for operations on long vectors one does not have to care about the detrimental effects of cache misses that often ruin the performance of RISC processors unless code is carefully blocked and unrolled. First experiments have shown that this idea seems to work quite well. THe system supports distributed I/O with a possibility to connect disks to every node.
As in the earlier SR2001, the connection structure is a hyper (3-D) crossbar which connects all nodes directly at high speed (300 MB/s point-to-point). In February 1996 two 1024-node systems were installed at the Universities of Tokyo and Tsukuba respectively. The latter has been extended to the (non-commercial) CP-PACS system that has 2048 processors but less memory (128 MB) per processor.
Like in some other systems as the Cray T3E, Meiko CS-2, and the NEC Cenju-3, one is able to directly access the memories of remote processors. Together with the very fast hardware-based barrier synchronisation this should allow for writing distributed programs with very low parallelisation overhead.
The following software products will be supported in addition to those already mentioned above: PVM, MPI, PARMACS, Linda, and FORGE90. In addition a numerical libraries like NAG and IMSL are offered.
Measured Performances: In  a speed of 232.3 Gflop/s is reported for solving a dense linear system of size 138,240 on a 1,024 processor system. Also some results of class A NAS parallel benchmarks show that the SR2201 runs at about 5.7 Gflop/s on the MG benchmark, at 5.9 Gflop/s on the BT benchmark, and at 5.4 Gflop/s on the SP benchmark, all on 256 processors ().