ScaLAPACK is a library of high performance linear algebra routines for distributed memory MIMD computers. It is a continuation of the LAPACK project, which designed and produced analogous software for workstations, vector supercomputers, and shared memory parallel computers. Both libraries contain routines for solving systems of linear equations, least squares problems, and eigenvalue problems. The goals of both projects are efficiency (to run as fast as possible), scalability (as the problem size and number of processors grow), reliability (including error bounds), portability (across all important parallel machines), flexibility (so users can construct new routines from well-designed parts), and ease-of-use (by making LAPACK and ScaLAPACK look as similar as possible). Many of these goals, particularly portability, are aided by developing and promoting standards, especially for low-level communication and computation routines. We have been successful in attaining these goals, limiting most machine dependencies to two standard libraries called the BLAS, or Basic Linear Algebra Subroutines [16][14][7][6], and BLACS, or Basic Linear Algebra Communication Subroutines [10][8]. LAPACK and ScaLAPACK will run on any machine where the BLAS and the BLACS are available.
The first part of this paper presents the design of ScaLAPACK. After a brief discussion of the BLAS and LAPACK, the block cyclic data layout, the BLACS, the PBLAS (Parallel BLAS), and the algorithms used are discussed. We also outline the difficulties encountered in producing correct code for networks of heterogeneous processors; difficulties we believe are little recognized by other practitioners.
The second part of this paper presents a theoretical model of parallel computers dedicated to dense linear algebra: the Distributed Linear Algebra Machine (DLAM). This ideal model provides a convenient framework for developing parallel algorithms. Moreover, it can be applied to obtain theoretical performance bounds and to analyze the scalability and programmability of parallel algorithms.
Finally, the paper discusses the performance of ScaLAPACK. Extensive results on various platforms are presented. One of our goals is to model and predict the performance of each routine as a function of a few problem and machine parameters. We show how the DLAM can be used to express this function, identify performance bottlenecks during development, and help users to choose various implementation parameters (like the number of processors) to optimize performance. One interesting result is that for some algorithms, speed is not a monotonic increasing function of the number of processors. In other words, speed can be increased by letting some processors remain idle.