The total number
of floating-point
operations performed
by most of the ScaLAPACK
driver routines for
dense matrices
can be approximated
by the quantity
, where
is a constant
and N is the
order of the
largest matrix
operand. For
solving linear
equations or
linear least
squares,
is a constant
depending solely
on the selected
algorithm. The
algorithms used
to find eigenvalues
and singular
values are
iterative; hence,
for these operations
the constant
truly depends
on the input
data as well.
It is, however,
customary or
``standard'' to
consider the
values of the
constants
for a fixed
number of
iterations.
The ``standard''
constants
range from 1/3
to approximately
18, as shown
in Table 5.8.
The performance
of the ScaLAPACK
drivers is thus
bounded above
by the performance
of a computation
that could be
partitioned into
P independent
chunks of
floating-point
operations each.
This upper bound,
referred to
hereafter as
the peak
performance ,
can be computed
as the product
of
and the highest
reachable local
node flop rate.
Hence, for
a given problem
size N and
assuming a uniform
distribution of the
computational tasks,
the most important
factors determining
the overall performance
are the number
P of nodes
involved in the
computation and
the local node
flop rate.
In a serial computational environment, transportable efficiency is the essential motivation for developing blocking strategies and block-partitioned algorithms [2, 3, 35, 90] . The linear algebra package (LAPACK) [3] is the archetype of such a strategy. The LAPACK software is constructed as much as possible out of calls to the BLAS. These kernels confine the impact of the computer architecture differences to a small number of routines. The efficiency and portability of the LAPACK software are then achieved by combining native and efficient BLAS implementations with portable high-level components.
The BLAS are subdivided into three levels, each of which offers increased scope for exploiting parallelism. This subdivision corresponds to three different types of basic linear algebra operations:
The performance potential of the three levels of BLAS is strongly related to the ratio of floating-point operations to memory references, as well as to the reuse of data when it is stored in the higher levels of the memory hierarchy. Consequently, the Level 1 BLAS cannot achieve high efficiency on most modern supercomputers. The Level 2 BLAS can achieve near-peak performance on many vector processors; on RISC microprocessors, however, their performance is limited by the memory access bandwidth bottleneck. The greatest scope for exploiting the highest levels of the memory hierarchy as well as other forms of parallelism is offered by the Level 3 BLAS [3].
The previous reasoning applies to distributed-memory computational environments in two ways. First, in order to achieve overall high performance, it is necessary to express the bulk of the computation local to each node in terms of Level 3 BLAS operations. Second, designing and developing a set of parallel BLAS (PBLAS) for distributed-memory computers should lead to an efficient and straightforward port of the LAPACK software. This is the path followed by the ScaLAPACK initiative [25, 53] as well as others [1, 21, 30, 63]. As part of the ScaLAPACK project, a set of PBLAS has been early designed and developed [29, 26].