ScaLAPACK requires that all global data (vectors or matrices) be distributed across the processes prior to invoking the ScaLAPACK routines. The storage schemes of global data structures in ScaLAPACK are conceptually the same as for LAPACK.

Global data is mapped to the
local memories of processes
assuming specific data distributions.
The local data on each process is
referred to as the **local array**.

The layout of an application's
data within the hierarchical
memory
of a concurrent computer is
critical in determining the
performance and scalability
of the parallel code.
On shared-memory concurrent
computers (or *multiprocessors*)
LAPACK seeks to make efficient
use of the hierarchical memory
by maximizing data reuse (e.g.,
on a cache-based computer LAPACK
avoids having to reload the cache
too frequently). Specifically,
LAPACK casts linear algebra
computations in terms of
block-oriented, matrix-matrix
operations through the use of
the Level 3 BLAS whenever
possible. This approach
generally results in maximizing
the ratio of floating-point
operations to memory references
and enables data reuse as much
as possible while it is stored
in the highest levels of the
memory hierarchy (e.g., vector
registers or high-speed cache).

An analogous approach has been followed in the design of ScaLAPACK for distributed-memory machines. By using block-partitioned algorithms we seek to reduce the frequency with which data must be transferred between processes, thereby reducing the fixed startup cost (or latency) incurred each time a message is communicated.

The ScaLAPACK routines for solving dense linear systems and eigenvalue problems assume that all global data has been distributed to the processes with a one-dimensional or two-dimensional block-cyclic data distribution. This distribution is a natural expression of the block-partitioned algorithms present in ScaLAPACK. The ScaLAPACK routines for solving band linear systems and tridiagonal systems assume that all global data has been distributed to the processes with a one-dimensional block data distribution. Each of these distributions is supported in the High Performance Fortran standard [91]. Explanations for each distribution will be presented and accompanied by the appropriate HPF directives.

Our implementation of ScaLAPACK emphasizes the mathematical view of a matrix over its storage. In fact, it is even possible to reuse our interface for a different block data distribution that would not fit in the block-cyclic scheme.

Tue May 13 09:21:01 EDT 1997