For the past 15 years or so, there has been a great deal of activity in the area of algorithms and software for solving linear algebra problems. The linear algebra community has long recognized the need for help in developing algorithms into software libraries, and several years ago, as a community effort, put together a de facto standard for identifying basic operations required in linear algebra algorithms and software. The hope was that the routines making up this standard, known collectively as the Basic Linear Algebra Subprograms (BLAS), would be efficiently implemented on advanced-architecture computers by many manufacturers, making it possible to reap the portability benefits of having them efficiently implemented on a wide range of machines. This goal has been largely realized.
The key insight of this approach to designing linear algebra algorithms for advanced architecture computers is that the frequency with which data are moved between different levels of the memory hierarchy must be minimized in order to attain high performance. Thus, our main algorithmic approach for exploiting both vectorization and parallelism in our implementations is the use of block-partitioned algorithms, particularly in conjunction with highly-tuned kernels for performing matrix-vector and matrix-matrix operations (the Level 2 and 3 BLAS). In general, the use of block-partitioned algorithms requires data to be moved as blocks, rather than as vectors or scalars, so that although the total amount of data moved is unchanged, the latency (or startup cost) associated with the movement is greatly reduced because fewer messages are needed to move the data.
A second key idea is that the performance of an algorithm can be tuned by a user by varying the parameters that specify the data layout. On shared memory machines, this is controlled by the block size, while on distributed memory machines it is controlled by the block size and the configuration of the logical process mesh.
The way in which an algorithm's data are distributed over the processors of a parallel computer has a major impact on the load balance and communication characteristics of the parallel algorithm, and hence largely determines its performance and scalability. The block scattered (or block cyclic) decomposition provides a simple, yet general-purpose, way of distributing a block-partitioned matrix on distributed memory parallel computers. In the block scattered decomposition, described in detail in [26], a matrix is partitioned into blocks of size r s , and blocks separated by a fixed stride in the column and row directions are assigned to the same processor. If the stride in the column and row directions is P and Q blocks respectively, then we require that P Q equals the number of processors, N_p. Thus, it is useful to imagine the processors arranged as a P Q mesh, or template. Then the processor at position (p,q) ( 0 p < P , 0 q <Q ) in the template is assigned the blocks indexed by,
where i=0,...,(M_b-p-1)/P , j=0,...,(N_b-q-1)/Q , and M_b x N_b is the size of the matrix in blocks.
Blocks are scattered in this way so that good load balance can be maintained in algorithms, such as LU factorization [27,28], in which rows and/or columns of blocks of a matrix become eliminated as the algorithm progresses. However, for some of the distributed Level 3 BLAS routines a scattered decomposition does not improve load balance, and may result in higher concurrent overhead. The general matrix-matrix multiplication routine xGEMM is an example of such a routine for which a pure block (i.e., nonscattered) decomposition is optimal when considering the routine in isolation. However, xGEMM may be used in an application for which, overall, a scattered decomposition is best.
The underlying concept of the implementations we have chosen for dense matrix computations is the use of block-partitioned algorithms to minimize data movement between different levels in hierarchical memory. The ideas discussed here for dense linear algebra computations are applicable to any computer with a hierarchical memory that (1) imposes a sufficiently large startup cost on the movement of data between different levels in the hierarchy, and for which (2) the cost of a context switch is too great to make fine grain size multithreading worthwhile. These ideas have been exploited by the software packages LAPACK [7] and ScaLapack [29]. The PARKBENCH suite includes five matrix kernels.