Performance of Selected PBLAS Routines

Next: Solution of Common Numerical Up: ScaLAPACK Strategy Previous: Software Hierarchy

Performance of Selected PBLAS Routines

On a distributed-memory concurrent computer consisting of RISC processors, such as the IBM SP2 or the Intel MP Paragon, the performance of the Level 2 PBLAS is limited by the rate of data movement between different levels of memory within a processor. In other words, the performance of the Level 2 BLAS on each processor considerably limits the performance of the equivalent distributed operation. Table 2 shows the performance results obtained by the general matrix-vector multiply PBLAS routine PDGEMV.

table141
Table 2: Speed in Megaflop/s of the PBLAS Matrix-Vector Multiply Routines for matrices of order N with TRANS='N' PDGEMV

This limitation is overcome by the Level 3 PBLAS, which locally perform floating-point operations on data. The flop rate achieved by every processor for such a distributed operation is then much higher. Table 3 shows the performance results obtained by the general matrix-matrix multiply PBLAS routine PDGEMM for square matrices of order N.

table159
Table 3: Speed in Megaflop/s of the PBLAS Matrix-Matrix Multiply Routines PDGEMM for matrices of order N with TRANSA='N' and TRANSB='N'

Jack Dongarra
Sat Feb 1 08:18:10 EST 1997