The total volume of data communicated by most of the ScaLAPACK driver routines for dense matrices can be approximated by the quantity , where N is the order of the largest matrix operand. The number of messages, however, is proportional to N and can be approximated by the quantity , where NB is the logical blocking factor used in the computation. Similar to the situation described above, the ``standard'' constants for the communication volume depend upon the performed computation and are of the same order as the floating-point operation constants shown in Table 5.8. The values of the ``standard'' constants for a few selected ScaLAPACK drivers are presented in Table 5.8. As a result, a significant percentage of the ScaLAPACK software aims at exchanging messages between processes.
Developing an adequate message-passing interface specialized for linear algebra operations has been one of the first achievements of the ScaLAPACK project. The Basic Linear Algebra Communications Subprograms (BLACS) [50, 54] were thus specifically designed to facilitate the expression of the relevant communication operations. The simplicity of the BLACS interface, as well as the rigor of their specification, allows for an easy port of the entire ScaLAPACK software. Currently, the BLACS have been efficiently ported on machine-specific message-passing libraries such as the IBM (MPL) and Intel (NX) message-passing libraries, as well as more generic interfaces such as PVM and MPI . The BLACS overhead has been shown to be negligible .
The BLACS interface provides the user and library designer with an appropriate level of notation. Indeed, the BLACS operate on typed two-dimensional arrays. The computational model consists of a one- or two-dimensional grid of processes, where each process stores matrices and vectors. The BLACS include synchronous send/receive routines to send a matrix or submatrix from one process to another, to broadcast submatrices, or to perform global reductions (sums, maxima and minima). Other routines establish, change, or query the process grid. The BLACS provide an adequate interface level for linear algebra communication operations.
For ease of use and flexibility, the BLACS send operation is locally blocking; that is, the return from the send operation indicates that the resources may be reused. However, since this depends only on local information, it is unknown whether the receive operation has been called. Buffering is necessary on the sending or the receiving process. The BLACS receive operation is globally blocking . The return from the receive operation indicates that the message has been (sent and) received. On a system natively supporting globally blocking sends such as the IBM SP2 computer, nonblocking sends coupled with buffering are used to simulate locally blocking sends. This extra buffering operation may cause a slight performance degradation on those systems.
The BLACS broadcast and combine operations feature the ability of selecting different virtual network topologies. This easy-to-use built-in facility allows for the expression of various message scheduling approaches, such as a communication pipeline. This unique and distinctive BLACS characteristic is necessary for achieving the highest performance levels on distributed-memory platforms.