On a distributed-memory concurrent computer consisting of RISC processors, such as the IBM SP2 or the Intel MP Paragon, the performance of the Level 2 PBLAS is limited by the rate of data movement between different levels of memory within a processor. In other words, the performance of the Level 2 BLAS on each processor considerably limits the performance of the equivalent distributed operation. Table 2 shows the performance results obtained by the general matrix-vector multiply PBLAS routine PDGEMV.
Table 2: Speed in Megaflop/s of the PBLAS Matrix-Vector
Multiply Routines
for matrices of order N with TRANS='N'
PDGEMV
This limitation is overcome by the Level 3 PBLAS, which locally perform floating-point operations on data. The flop rate achieved by every processor for such a distributed operation is then much higher. Table 3 shows the performance results obtained by the general matrix-matrix multiply PBLAS routine PDGEMM for square matrices of order N.
Table 3: Speed in Megaflop/s of the PBLAS Matrix-Matrix
Multiply Routines PDGEMM
for matrices of order N with TRANSA='N' and TRANSB='N'