The total number
of floating-point
operations
performed by most
of the ScaLAPACK
driver routines
for dense matrices
can be approximated
by the quantity
, where
is a constant
and N is the
order of the
largest matrix
operand. For
solving linear equations
or linear
least squares,
is a constant
depending
solely on the selected
algorithm. The
algorithms used
to find eigenvalues
and singular
values are
iterative; hence,
for these
operations, the
constant
truly depends
on the input
data as well.
It is, however,
customary or
``standard'' to
consider the
values of the
constants
for a fixed
number of
iterations.
The ``standard''
constants
range
from 1/3 to
27, as shown
in Table
4.
The performance
of the ScaLAPACK
drivers is thus
bounded above
by the performance
of a computation
that could be
partitioned into
p independent
chunks of
flops each. This
upper bound is
referred to hereafter
as the peak
performance and
can be computed
as the product
of
and the highest
reachable local
processor flop
rate. Hence, for
a given problem
size N and
assuming a uniform
distribution of the
computational tasks,
the most important
factors determining
the overall performance
are the number
p of processors
involved in the
computation and
the local processor
flop rate.
In a serial computational environment, transportable efficiency is the essential motivation for developing blocking strategies and block-partitioned algorithms [2, 3, 14, 27]. The linear algebra package (LAPACK) [3] is the archetype of such a strategy. The LAPACK software is constructed as much as possible out of calls to the BLAS. These kernels confine the impact of machine architecture differences within a small number of routines. The efficiency and portability of the LAPACK software are then achieved by combining native and efficient BLAS implementations with portable high-level components.
The BLAS are subdivided into three levels, each of which offers increased scope for exploiting parallelism. This subdivision corresponds to three different kinds of basic linear algebra operations:
The performance potential of the three levels of BLAS is strongly related to the ratio of floating-point operations to memory references, as well as to the reuse of data when it is stored in the higher levels of the memory hierarchy. Consequently, the Level 1 BLAS cannot achieve high efficiency on most modern supercomputers. The Level 2 BLAS can achieve near-peak performance on many vector processors. On RISC microprocessors, however, their performance is limited by the memory access bandwidth bottleneck. The greatest scope for exploiting the highest levels of the memory hierarchy as well as other forms of parallelism is offered by the Level 3 BLAS [3].
The previous reasoning applies to distributed-memory computational environments in two ways. First, in order to achieve overall high performance, it is necessary to express the bulk of the computation local to each process in terms of Level 3 BLAS operations. Second, designing and developing a set of parallel BLAS (PBLAS) for distributed-memory concurrent computers should lead to an efficient and straightforward port of the LAPACK software. This is the path followed by the ScaLAPACK project [8, 18] as well as others [1, 7, 12, 20]. As part of the ScaLAPACK project, a set of PBLAS has been early designed and developed [11, 9].