ScaLAPACK is designed to give high efficiency on MIMD distributed memory concurrent supercomputers, such as the Intel Paragon, IBM SP series, and the Cray T3 series. In addition, the software is designed so that it can be used with clusters of workstations through a networked environment and with a heterogeneous computing environment via PVM or MPI. Indeed, ScaLAPACK can run on any machine that supports either PVM or MPI . See Chapter 5 for some examples of the performance achieved by ScaLAPACK routines.
The ScaLAPACK strategy for combining efficiency with portability is to construct the software so that as much as possible of the computation is performed by calls to the Parallel Basic Linear Algebra Subprograms (PBLAS). The PBLAS [26, 104] perform global computation by relying on the Basic Linear Algebra Subprograms (BLAS) [93, 59, 57] for local computation and the Basic Linear Algebra Communication Subprograms (BLACS) [54, 113] for communication.
The efficiency of ScaLAPACK software depends on the use of block-partitioned algorithms and on efficient implementations of the BLAS and the BLACS being provided by computer vendors (and others) for their machines. Thus, the BLAS and the BLACS form a low-level interface between ScaLAPACK software and different machine architectures. Above this level, all of the ScaLAPACK software is portable.
The BLAS, PBLAS, and the BLACS are not, strictly speaking, part of ScaLAPACK. C code for the PBLAS is included in the ScaLAPACK distribution. Since the performance of the package depends upon the BLAS and the BLACS being implemented efficiently, we have not included this software with the ScaLAPACK distribution. A machine-specific implementation of the BLAS and the BLACS should be used. If a machine-optimized version of the BLAS is not available, a Fortran 77 reference implementation of the BLAS is available from netlib (see section 1.5). This code constitutes the ``model implementation'' [58, 56]. The model implementation of the BLAS is not expected to perform as well as a specially tuned implementation on most high-performance computers -- on some machines it may give much worse performance -- but it allows users to run ScaLAPACK codes on machines that do not offer any other implementation of the BLAS.
If a vendor-optimized version of the BLACS is not available for a specific architecture, efficiently ported versions of the BLACS are available on netlib. Currently, the BLACS have been efficiently ported on machine-specific message-passing libraries such as the IBM (MPL) and Intel (NX) message-passing libraries, as well as more generic interfaces such as PVM and MPI . The BLACS overhead has been shown to be negligible in [54]. Refer to the URL for the blacs directory on netlib for more details:
http://www.netlib.org/blacs/index.html