The total volume
of data communicated
by most of the
ScaLAPACK driver
routines for dense
matrices can be
approximated by
the quantity
, where
N is the order
of the largest
matrix operand.
The number of
messages, however,
is proportional to
N and can be
approximated by
the quantity
,
where NB is the
logical blocking
factor used in
the computation.
Similar to the
situation described
above, the ``standard''
constants
for
the communication
volume depend upon
the performed computation
and are of the same
order as the
floating-point
operation constants
shown in
Table 5.8.
The values of the
``standard''
constants
for a few selected
ScaLAPACK drivers
are presented in
Table 5.8.
As a result, a
significant
percentage of
the ScaLAPACK
software aims
at exchanging
messages between
processes.
Developing an adequate message-passing interface specialized for linear algebra operations has been one of the first achievements of the ScaLAPACK project. The Basic Linear Algebra Communications Subprograms (BLACS) [50, 54] were thus specifically designed to facilitate the expression of the relevant communication operations. The simplicity of the BLACS interface, as well as the rigor of their specification, allows for an easy port of the entire ScaLAPACK software. Currently, the BLACS have been efficiently ported on machine-specific message-passing libraries such as the IBM (MPL) and Intel (NX) message-passing libraries, as well as more generic interfaces such as PVM and MPI . The BLACS overhead has been shown to be negligible [54].
The BLACS interface provides the user and library designer with an appropriate level of notation. Indeed, the BLACS operate on typed two-dimensional arrays. The computational model consists of a one- or two-dimensional grid of processes, where each process stores matrices and vectors. The BLACS include synchronous send/receive routines to send a matrix or submatrix from one process to another, to broadcast submatrices, or to perform global reductions (sums, maxima and minima). Other routines establish, change, or query the process grid. The BLACS provide an adequate interface level for linear algebra communication operations.
For ease of use and flexibility, the BLACS send operation is locally blocking; that is, the return from the send operation indicates that the resources may be reused. However, since this depends only on local information, it is unknown whether the receive operation has been called. Buffering is necessary on the sending or the receiving process. The BLACS receive operation is globally blocking . The return from the receive operation indicates that the message has been (sent and) received. On a system natively supporting globally blocking sends such as the IBM SP2 computer, nonblocking sends coupled with buffering are used to simulate locally blocking sends. This extra buffering operation may cause a slight performance degradation on those systems.
The BLACS broadcast and combine operations feature the ability of selecting different virtual network topologies. This easy-to-use built-in facility allows for the expression of various message scheduling approaches, such as a communication pipeline. This unique and distinctive BLACS characteristic is necessary for achieving the highest performance levels on distributed-memory platforms.