In order to simplify the design of ScaLAPACK, and because the BLAS have proven to be very useful tools outside LAPACK, we chose to build a Parallel BLAS, or PBLAS, whose interface is as similar to the BLAS as possible. This decision has permitted the ScaLAPACK code to be quite similar, and sometimes nearly identical, to the analogous LAPACK code. Only one substantially new routine was added to the PBLAS, matrix transposition, since this is a complicated operation in a distributed memory environment [3].
We hope that the PBLAS will provide a distributed memory standard, just as the BLAS have provided a shared memory standard. This would simplify and encourage the development of high performance and portable parallel numerical software, as well as providing manufacturers with a small set of routines to be optimized. The acceptance of the PBLAS requires reasonable compromises among competing goals of functionality and simplicity. These issues are discussed below.
The PBLAS operate on matrices distributed in a 2D block cyclic layout. Since such a data layout requires many parameters to fully describe the distributed matrix, we have chosen a more object-oriented approach, and encapsulated these parameters in an integer array called an array descriptor. An array descriptor includes
(1) the number of rows in the distributed matrix,
(2) the number of columns in the distributed matrix,
(3) the row block size ( in section 2.5),
(4) the column block size ( in section 2.5),
(5) the process row over which the first row of the matrix is distributed,
(6) the process column over which the first column of the matrix is distributed,
(7) the BLACS context, and
(8) the leading dimension of the local array storing the local blocks.
For example, here is an example of a call to the BLAS double precision matrix multiplication routine DGEMM, and the corresponding PBLAS routine PDGEMM; note how similar they are:
CALL DGEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A( IA, JA ), LDA, B( IB, JB ), LDB, BETA, C( IC, JC ), LDC ) CALL PDGEMM( TRANSA, TRANSB, M, N, K, ALPHA, A, IA, JA, DESC_A, B, IB, JB, DESC_B, BETA, C, IC, JC, DESC_C )DGEMM computes
C = BETA * C + ALPHA * op( A ) * op( B )
,
where op(A) is either A or its transpose depending on
TRANSA, op(B) is similar, op(A) is M-by-K,
and op(B) is K-by-N. PDGEMM is the same, with
the exception of the way in which submatrices are specified. To pass
the submatrix starting at A(IA,JA) to DGEMM, for example,
the actual argument corresponding to the formal argument A would
simply be A(IA,JA). PDGEMM, on the other hand, needs to
understand the global storage scheme of A to extract the correct
submatrix, so IA and JA must be passed in separately.
DESC_A is the array descriptor for A. The parameters
describing the matrix operands B and C are analogous
to those describing A. In a truly object-oriented environment
matrices and DESC_A would be the synonymous. However, this
would require language support, and detract from portability.
Our implementation of the PBLAS emphasizes the mathematical view of a matrix over its storage. In fact, it is even possible to reuse our interface to implement the PBLAS for a different block data distribution that would not fit in the block-cyclic scheme.
The presence of a context associated with every distributed matrix provides the ability to have separate ``universes'' of message passing. The use of separate communication contexts by distinct libraries (or distinct library invocations) such as the PBLAS insulates communication internal to the library from external communication. When more than one descriptor array is present in the argument list of a routine in the PBLAS, it is required that the individual BLACS context entries must be equal. In other words, the PBLAS do not perform ``intra-context'' operations.
We have not included specialized routines to take advantage of packed storage schemes for symmetric, Hermitian, or triangular matrices, nor of compact storage schemes for banded matrices.