The total number
of floating-point
operations performed
by most of the ScaLAPACK
driver routines for
dense matrices
can be approximated
by the quantity
, where
is a constant
and *N* is the
order of the
largest matrix
operand. For
solving linear
equations or
linear least
squares,
is a constant
depending solely
on the selected
algorithm. The
algorithms used
to find eigenvalues
and singular
values are
iterative; hence,
for these operations
the constant
truly depends
on the input
data as well.
It is, however,
customary or
``standard'' to
consider the
values of the
constants
for a fixed
number of
iterations.
The ``standard''
constants
range from 1/3
to approximately
18, as shown
in Table 5.8.

The performance
of the ScaLAPACK
drivers is thus
bounded above
by the performance
of a computation
that could be
partitioned into
*P* independent
chunks of
floating-point
operations each.
This upper bound,
referred to
hereafter as
the *peak
performance* ,
can be computed
as the product
of
and the highest
reachable local
node flop rate.
Hence, for
a given problem
size *N* and
assuming a uniform
distribution of the
computational tasks,
the most important
factors determining
the overall performance
are the number
*P* of nodes
involved in the
computation and
the local node
flop rate.

In a serial
computational
environment,
*transportable
efficiency* is
the essential
motivation for
developing blocking
strategies and
block-partitioned
algorithms
[2, 3, 35, 90] .
The linear algebra
package (LAPACK)
[3] is
the archetype of
such a strategy.
The LAPACK software
is constructed as
much as possible
out of calls to
the BLAS. These
kernels confine
the impact of
the computer
architecture
differences
to a small
number of
routines. The
efficiency and
portability of
the LAPACK
software are
then achieved
by combining
native and
efficient BLAS
implementations
with portable
high-level
components.

The BLAS are subdivided into three levels, each of which offers increased scope for exploiting parallelism. This subdivision corresponds to three different types of basic linear algebra operations:

- Level 1 BLAS [93] : for vector operations, such as ,
- Level 2 BLAS [59, 58] : for matrix-vector operations, such as ,
- Level 3 BLAS [57, 56] : for matrix-matrix operations, such as .

The performance potential of the three levels of BLAS is strongly related to the ratio of floating-point operations to memory references, as well as to the reuse of data when it is stored in the higher levels of the memory hierarchy. Consequently, the Level 1 BLAS cannot achieve high efficiency on most modern supercomputers. The Level 2 BLAS can achieve near-peak performance on many vector processors; on RISC microprocessors, however, their performance is limited by the memory access bandwidth bottleneck. The greatest scope for exploiting the highest levels of the memory hierarchy as well as other forms of parallelism is offered by the Level 3 BLAS [3].

The previous reasoning applies to distributed-memory computational environments in two ways. First, in order to achieve overall high performance, it is necessary to express the bulk of the computation local to each node in terms of Level 3 BLAS operations. Second, designing and developing a set of parallel BLAS (PBLAS) for distributed-memory computers should lead to an efficient and straightforward port of the LAPACK software. This is the path followed by the ScaLAPACK initiative [25, 53] as well as others [1, 21, 30, 63]. As part of the ScaLAPACK project, a set of PBLAS has been early designed and developed [29, 26].

Tue May 13 09:21:01 EDT 1997