ScaLAPACK requires that all global data (vectors or matrices) be distributed across the processes prior to invoking the ScaLAPACK routines. The storage schemes of global data structures in ScaLAPACK are conceptually the same as for LAPACK.
Global data is mapped to the local memories of processes assuming specific data distributions. The local data on each process is referred to as the local array.
The layout of an application's data within the hierarchical memory of a concurrent computer is critical in determining the performance and scalability of the parallel code. On shared-memory concurrent computers (or multiprocessors) LAPACK seeks to make efficient use of the hierarchical memory by maximizing data reuse (e.g., on a cache-based computer LAPACK avoids having to reload the cache too frequently). Specifically, LAPACK casts linear algebra computations in terms of block-oriented, matrix-matrix operations through the use of the Level 3 BLAS whenever possible. This approach generally results in maximizing the ratio of floating-point operations to memory references and enables data reuse as much as possible while it is stored in the highest levels of the memory hierarchy (e.g., vector registers or high-speed cache).
An analogous approach has been followed in the design of ScaLAPACK for distributed-memory machines. By using block-partitioned algorithms we seek to reduce the frequency with which data must be transferred between processes, thereby reducing the fixed startup cost (or latency) incurred each time a message is communicated.
The ScaLAPACK routines for solving dense linear systems and eigenvalue problems assume that all global data has been distributed to the processes with a one-dimensional or two-dimensional block-cyclic data distribution. This distribution is a natural expression of the block-partitioned algorithms present in ScaLAPACK. The ScaLAPACK routines for solving band linear systems and tridiagonal systems assume that all global data has been distributed to the processes with a one-dimensional block data distribution. Each of these distributions is supported in the High Performance Fortran standard [91]. Explanations for each distribution will be presented and accompanied by the appropriate HPF directives.
Our implementation of ScaLAPACK emphasizes the mathematical view of a matrix over its storage. In fact, it is even possible to reuse our interface for a different block data distribution that would not fit in the block-cyclic scheme.