In 1987 Dongarra, Du Croz, Duff and Hammarling wrote an article in the ACM Trans. Math. Soft. (Vol. 16, no. 1, page 1) defining and proposing a set of Level 3 Basic Linear Algebra Subprograms. That proposal logically concluded a period of reflection and discussion among the mathematical software community [24][21][12] to define a set of routines that would find wide application in software for numerical linear algebra and provide a useful tool for implementors and users. Because these subprograms and their predecessors - the Levels 1 and 2 BLAS - are an aid to clarity, portability, modularity and maintenance of software, they have been embraced by the community and have become a de facto standard for elementary linear algebra computations [11].
Many of the frequently used algorithms of numerical linear algebra can be implemented so that a majority of the computation is performed by calls to the Level 2 and Level 3 BLAS. By relying on these basic kernels, it is possible to develop portable and efficient software across a wide range of architectures, with emphasis on workstations, vector-processors and shared-memory computers, as has been done in LAPACK [2].
As opposed to shared-memory systems, distributed-memory computers differ significantly from the software point of view. The underlying interconnection network as well as the vendor supplied communication library are usually machine specific. The ongoing Message Passing Interface (MPI) standardization effort [17] will undoubtly be of great benefit to the user community. Nevertheless, a large variety of distributed-memory systems still exists and this motivated the development of a set of portable communication subprograms well suited for linear algebra computations: the Basic Linear Algebra Communication Subprograms (BLACS) [27][14]. In addition to defining a portable interface the BLACS also provide the correct level of abstraction. They allow the software writer to focus on performing message passing on subsections of matrices rather than at low level byte transfers.
There has been much interest recently in developing parallel versions of the BLAS for distributed memory computers [16][15][3][1]. Some of this research proposed parallelizing the BLAS, and some implemented a few important BLAS routines, such as matrix-matrix multiplication. Almost ten years after the very successful BLAS were proposed, we are in a position to define and implement a set of Basic Linear Algebra Subprograms for distributed-memory computers with similar functionality as their sequential predecessors. The proposed set of routines that constitute the Parallel Basic Linear Algebra Subprograms results from the adaptation to distributed memory computers and reuse of the design decisions made for the BLAS. The local computations within a process are performed by the BLAS, while the communication operations are handled by the BLACS.
In an effort to simplify the parallelization of serial codes implemented on top of the BLAS, the PBLAS proposed here are targeted at vector-vector, matrix-vector and matrix-matrix operations. The last section illustrates how some common algorithms can be implemented by calls to the proposed routines. There is certainly considerable evidence for the efficiency of such algorithms on various machines [18]. Such implementations are portable and efficient across a wide variety of distributed memory MIMD computers, ranging from a heterogeneous network of workstations to a statically connected set of identical processors, provided that efficient machine-specific BLAS and BLACS are available.
The scope of this proposal is limited. First, the set of routines described in this paper constitutes an extended proper subset of the BLAS. For instance, this proposal does not contain vector rotation routines or dedicated subprograms for banded or packed matrices. A matrix transposition routine has been added to the Level 3 subprograms since this operation is much more complex to perform and implement on distributed-memory computers. Second, this proposal does not include routines for matrix factorizations or reductions; these are covered by the ScaLAPACK (Scalable Linear Algebra PACKage) project [7][6]. A reference implementation version of the PBLAS is available on netlib (http://www.netlib.org). Vendors can then supply system optimized versions of the BLAS, the BLACS and eventually the PBLAS. It is our hope that this proposal will initiate discussions among the computer science community so that this project will best reflect its needs.
This proposal is intended primarily for software developers and to a lesser extent for experienced applications programmers. The details of this proposal are concerned with defining a set of subroutines for use in FORTRAN 77 and C programs. However, the essential features of this standard should be easily adaptable to other programming languages. We have attempted to pave the way for such a future evolution by respecting the driving concepts of the HPF [23] and MPI [17] projects.