Parallel BLAS (version 2.0)

For details on the techniques used within the software,
please refer to LAPACK Working Note 128:
   http://www.netlib.org/lapack/lawns/lawn128.ps

By design, most of the algorithms implemented in the
ScaLAPACK software library rely heavily on a much
smaller set of routines called the Parallel Basic
Linear Algebra Subprograms (PBLAS).  These routines
first enhance the clarity of the ScaLAPACK software.
They greatly simplify the expression of high level
algorithms, such as the ScaLAPACK in- and out-of-core
linear system solvers and eigensolvers. Second, by
focusing the developers' attention on a smaller set
of elementary operations, highly efficient algorithms
have been implemented within the PBLAS. As a consequence,
the ScaLAPACK software achieves high performance on a
large number of distributed memory concurrent computers
ranging from a cluster or pile of PCs to the most
powerful supercomputers available today.
 
Because the PBLAS have become a useful tool in developing
and studying new parallel algorithms within or without the
ScaLAPACK research activities, it has been decided to
increase its flexibility by removing various alignment
restrictions and support replicated operands. For this
purpose, different algorithmic blocking techniques have
been studied and developed. These techniques aim at effectively
utilizing a distributed-memory hierarchy
independent from the parameters of the data decomposition.
These algorithms essentially logically partition the
computations to be performed. They have been successfully
implemented within the PBLAS, so that high efficiency
is achieved for nearly all possible operand's shapes.
In other words, the PBLAS have been made more flexible
and truly usable for a larger community. The PBLAS
efficiency has been maintained, so that these basic
building blocks remain an important piece of the ScaLAPACK
software effort.

Questions/Comments?  scalapack@cs.utk.edu