Machine type | Distributed-memory vector multi-processor |
---|---|
Models | VX, VPP300, VPP700 |
Operating system | UXP/V (a V5.4 based variant of Unix) |
Connection structure | Full distributed crossbar |
Compilers | Fortran 90/VP (Fortran 90 Vector compiler), Fortran 90/VPP (Fortran 90 Vector Parallel compiler), C/VP (C Vector compiler), C, C++ |
Vendors information Web page | http://www.fujitsu.co.jp/hypertext/Products/Info_process/hpc/vx-e/ |
Year of introduction | VX, VPP300: 1995, VPP700: 1996 |
System parameters:
Model | VX | VPP300 | VPP700 |
---|---|---|---|
Clock cycle | 7 ns | 7 ns | 7 ns |
Theor. peak performance | |||
Per proc. (64-bit) | 2.28 Gflop/s | 2.28 Gflop/s | 2.28 Gflop/s |
Maximal (64-bit) | 9.2 Gflop/s | 36.5 Gflop/s | 583.6 Gflop/s |
Main memory | |||
Memory/node | <= 2 GB | <= 2 GB | <= 2 GB |
Memory/maximal | <= 8 GB | <= 32 GB | <= 512 GB |
Memory bandwidth | |||
Memory banwidth/proc. | 18.2 GB/s | 18.2 GB/s | 18.2 GB/s |
Communication bandwidth | |||
Point-to-point | 570 MB/s | 570 MB/s | 570 MB/s |
No. of processors | 1-4 | 1-16 | 8-256 |
Remarks:
The VPP300 is a sucessor to the earlier VPP500. It is a much cheaper CMOS implementation of its predecessor with some important differences. First, no VPX200 front-end system is required anymore. Second, the crossbar that is used to connect the vector nodes is distributed. Therefore, the cost of a system is scalable: one does not need to buy a complete enclosure with the full crossbar for only a few nodes. The VX series is in fact a smaller version of the VPP300 with a maximum of 4 processors. Both the VX machines and the larger VPP300 systems are air-cooled.
The architecture of the VPP300 nodes is almost identical to that of the VPP500: Each node, called a Processing Element (PE) in the system is a powerful (2.28 Gflop/s peak speed with a 7 ns clock) vector processor in its own right. The vector processor is complemented by a RISC scalar processor with a peak speed of 285 Mflop/s dependent. The scalar instruction format is 64 bits wide and may cause the execution of three operations in parallel. Each PE has a memory of up to 2 GB while a PE communicates with its fellow PEs at a point-to-point speed of 570 MB/s. This communication is cared for by separate Data Transfer Units (DTUs). To enhance the communication efficiency, the DTU has various transfer modes like contiguous, stride, sub array, and indirect access. Also translation of logical to physical PE-ids and from Logical in-PE address to real address are handled by the DTUs. When synchronisation is required each PE can set its corresponding bit in the SR. The value of the SR is broadcast to all PEs and synchronisation has occurred if the SR has all its bits set for the relevant PEs. This method is comparable to the use of synchronisation registers in shared-memory vector processors and much faster than synchronising via memory.
The VPP700 is a logical extension of the Fujitsu VPP300. While the processors in the latter machine are connected by a full crossbar, the maximum configuration of a VPP700 consists of 16 clusters of 16 processors connected by a level-2 crossbar. So, a fully configured VPP700 consists in fact of 16 full VPP300s. Because the diameter of the network is 2 (for the larger configurations) instead of 1 as in the VPP300, the communication time between processors will be slightly larger. At the moment this worst case increase is not exactly known to the author.
The Fortran compiler that comes with the VPP300/700 has extensions that enable data decomposition by compiler directives. This evades in many cases restructuring of the code. The directives are different from those as defined in the High Performance Fortran Proposal but it should be easy to adapt them. Furthermore, it is possible do define parallel regions, barriers, etc., via directives, while there are several intrinsic functions to enquire about the number of processors and to execute POST/WAIT commands. Furthermore, also a message passing programming style is possible by using the PVM or PARMACS communication libraries that are available. Of course the software for the VPP700 and the VPP300 is exactly the same and the systems can run each others executables.
Measured Performances: In [2] results for the VX, the VPP300, and the VPP700 are given. The speed for solving dense linear system of sizes 28,800 59,200, and 111,360 was 8.6, 34.1, and 213 Gflop/s on a 4 proc. VX, a 16 proc VPP300, and a 116 proc. VPP700 respectively.