next up previous contents
Next: The IBM 9076 SP2 Up: Distributed-memory MIMD systems Previous: The Hitachi SR2201 series.

The HP/Convex Exemplar SPP-2000.

Machine type RISC-based distributed-memory multi-processor
Models SPP-20000K, SPP-2000S, SPP-2000X, Exemplar V-Class
Operating system 2000K/S/X: HI-UX/MPP (Micro kernel Mach 3.0), Exemplar V: HP-UX (HP's usual Unix flavour)
Connection structure Ring
Compilers Fortran 77, Fortran 90, Parallel Fortran, HPF, C, C++
Vendors information Web page 2000K/S/X: www.hp.com/wsg/products/servers/servpfo.html
Exemplar V-class: www.hp.com/gsy/products/vclass.html
Year of introduction 2000K/S/X: 1996, Exemplar V-Class: 1997.

System parameters:

Model SPP-2000K SPP-2000S SPP-2000X Exemplar V
Clock cycle 5.55 ns 5.55 ns 5.55 ns 5 ns
Theor. peak performance
Per proc. (64-bits) 720 Mflop/s 720 Mflop/s 720 Mflop/s 800 Mflop/s
Maximal 2.9 Gflop/s 11.5 Gflop/s 46.8 Gflop/s 12.8 Gflop/s
Main memory
Memory/node <=1 GB <=1 GB <=1 GB <=1 GB
Memory/maximal <= 4 GB <= 16 GB <= 64 GB <= 16 GB
Communication bandwidth
aggregate (see remarks) 3.84 GB/s 15.4 GB/s 15.4/3.84 4GB/s 15.4 GB/s
No. of processors 1-4 4-16 16-64 1-16

Remarks:

The SPP-2000 systems form the family of successors of the SPP-1200/1600. There are significant differences with respect to the preceding SPP-1200 generation. The SPP-2000K and S are shared memory machines connecting their maximally 4 and 16 PA-RISC 8000 processors, respectively, by a crossbar. Each processor has a peak performance of 720 Mflop/s and because the processors feature out-of-order execution of instructions it may be expected that memory latency effects can be evaded or diminished in a good many cases. This should make the impact of cache misses much less severe. Data and instruction caches are large (1 MB both) which also will help in minimising cache misses.

One SPP-2000S can be viewed as the successor of a hypernode in the earlier SPP-1200/SPP-1600 systems. As such the number of processors within a hypernode has doubled. Also the amount of memory per system has increased 8-fold from 8\tm256 MB to 16\tm 1 GB. The internal aggregate bandwidth is 15.36 GB/s for the 2000S and 3.84 GB/s for the 2000K. I/O can be done at an aggregate rate of 960 MB/s.

As in the earlier SPP-1200/1600 systems, the hypernodes are connected by uni-directional SCI rings with an aggregate bandwidth of 3.84 GB/s. This makes the SPP-2000X a NUMA machine when operates in a shared memory fashion. Note that, although the speed of computation has increased significantly with respect to the SPP-1200/1600 (more than a factor of 2), the interconection bandwidth did not grow.

The introduction of the new Exemplar V-class system is a bit confusing. Technically, there very little difference with the SPP-2000S system, except that the basic processor is a 5 ns clock PA-RISC 8200 one. The other difference is the operating system: this is HP-UX, the same Unix flavour that is also used on all other HP Unix platforms except the SPP-2000K/S/X systems. This enables the V-class systems to run all applications that are available for the HP systems, including commercial ones. From the presentation it seems that HP tries to position this system in the commercial market and less so for technical/scientific use.

The Exemplar programming environment as was available for the SPP-1200/1600 carries over to the SPP-2000K/S/X without changes. This environment includes a message passing programming model (PVM and MPI) and a virtual shared memory model which allows the user to have a shared-memory view of the system. Of course the shared memory model is not surprising for a symmetrical multiprocessor machine like the SPP-2000S but it is still valid in the SPP-2000X systems which effectively clusters four SPP-2000S systems. The (virtual) shared-memory model supported by HP/Convex has significant syntactical differences with models supported by other vendors. However, HP has agreed to support the OpenMP initiative ([12]) to standardise the shared-memory model across vendors.

Measured Performances: In [2] a speed of 27.6 Gflop/s is reported for a 64 proc. system when solving a 29,956-order dense linear system. For the EuroBen mod2a matrix-vector multiplication benchmark a speed of 417 Mflop/s is found on 16 processors. This is however for straight Fortran 77 code with PVM and without the use of library routines. On a 16-processor V-class machine a speed of 9.2 Gflop/s was measured in solving a 14944-order dense linear system.



next up previous contents
Next: The IBM 9076 SP2 Up: Distributed-memory MIMD systems Previous: The Hitachi SR2201 series.



Aad van der Steen
Thu Feb 12 15:54:05 MET 1998