We have demonstrated that the LAPACK factorization routines can be parallelized fairly easily to the corresponding ScaLAPACK routines with a small set of low-level modules, namely the sequential BLAS, the BLACS, and the PBLAS. We have seen that the PBLAS are particularly useful for developing and implementing a parallel dense linear algebra library relying on the block cyclic data distribution. In general, the Level 2 and 3 BLAS routines in the LAPACK code can be replaced on a one-for-one basis by the corresponding PBLAS routines. Parallel routines implemented with the PBLAS obtain good performance, since the computation performed by each process within PBLAS routines can itself be performed using the assembly-coded sequential BLAS.
In designing and implementing software libraries, there is a tradeoff between performance and software design considerations, such as modularity and clarity. As described in Section 4.1, it is possible to combine messages to reduce the communication cost in several places, and to replace the high level routines, such as the PBLAS, by calls to the lower level routines, such as the sequential BLAS and the BLACS. However, we have concluded that the performance gain is too small to justify the resulting loss of software modularity.
We have shown that the ScaLAPACK factorization routines have good performance and scalability on the Intel iPSC/860, Delta, and Paragon systems. Similar studies may be performed on other architectures to which the BLACS have been ported, including PVM, TMC CM-5, Cray T3D, and IBM SP1 and SP2.
The ScaLAPACK routines are currently available through netlib for all numeric data types, single precision real, double precision real, single precision complex and double precision complex. To obtain the routines, and the ScaLAPACK Reference Manual [10], send the message ``send index from scalapack'' to netlib@ornl.gov.