Parallel BLAS (version 2.0) For details on the techniques used within the software, please refer to LAPACK Working Note 128: http://www.netlib.org/lapack/lawns/lawn128.ps By design, most of the algorithms implemented in the ScaLAPACK software library rely heavily on a much smaller set of routines called the Parallel Basic Linear Algebra Subprograms (PBLAS). These routines first enhance the clarity of the ScaLAPACK software. They greatly simplify the expression of high level algorithms, such as the ScaLAPACK in- and out-of-core linear system solvers and eigensolvers. Second, by focusing the developers' attention on a smaller set of elementary operations, highly efficient algorithms have been implemented within the PBLAS. As a consequence, the ScaLAPACK software achieves high performance on a large number of distributed memory concurrent computers ranging from a cluster or pile of PCs to the most powerful supercomputers available today. Because the PBLAS have become a useful tool in developing and studying new parallel algorithms within or without the ScaLAPACK research activities, it has been decided to increase its flexibility by removing various alignment restrictions and support replicated operands. For this purpose, different algorithmic blocking techniques have been studied and developed. These techniques aim at effectively utilizing a distributed-memory hierarchy independent from the parameters of the data decomposition. These algorithms essentially logically partition the computations to be performed. They have been successfully implemented within the PBLAS, so that high efficiency is achieved for nearly all possible operand's shapes. In other words, the PBLAS have been made more flexible and truly usable for a larger community. The PBLAS efficiency has been maintained, so that these basic building blocks remain an important piece of the ScaLAPACK software effort. Questions/Comments? scalapack@cs.utk.edu