Figure 9.12: Process-Grid Data Distribution of Ax=b. Representation of a
concurrent matrix, and distributed-replicated concurrent vectors on a
logical process grid. The solution of Ax=b first appears in
x, a column-distributed vector, and then is normally ``transposed'' via a
global combine to the row-distributed vector y.
We solve the problem Ax=b where A is large, and includes many zero entries. We assume that A is unsymmetric both in sparsity pattern and in numerical values. In general, the matrix A will be computed in a distributed fashion, so we will inherit a distribution of the coefficients of A (cf., Figures 9.12 and 9.13). Following the style of Harwell's MA28 code for unsymmetric sparse matrices, we use a two-phase approach to this solution. There is a first LU factorization called A-mode or ``analyze,'' which builds data structures dynamically, and employs a user-defined pivoting function. The repeated B-mode factorization uses the existing data structures statically to factor a new, similarly structured matrix, with the previous pivoting pattern. B-mode monitors stability with a simple growth factor estimate. In practice, A-mode is repeated whenever instability is detected. The two key contributions of this sparse concurrent solver are reduced communication pivoting, and new data distributions for better overall performance.
Figure 9.13: Example of Process-Grid Data Distribution. An array
with block-linear rows (B=2) and scattered columns on a
logical process grid. Local arrays are denoted at left by where
is the grid position of the process on . Subscripts
(i.e., ) are the global (I,J) indices.
Following Van de Velde [Velde:90a], we consider the LU factorization of a real matrix A, . It is well known (e.g., [Golub:89a], pp. 117-118), that for any such matrix A, an LU factorization of the form
exists, where are square, (orthogonal) permutation matrices, and are the unit lower- and upper-triangular factors, respectively. Whereas the pivot sequence is stored (two N-length integer vectors), the permutation matrices are not stored or computed with explicitly. Rearranging, based on the orthogonality of the permutation matrices, We factor A with implicit pivoting (no rows or columns are exchanged explicitly as a result of pivoting). Therefore, we do not store directly, but instead: . Consequently, , , and . The ``unravelling'' of the permutation matrices is accomplished readily (without implication of additional interprocess communication) during the triangular solves.
For the sparse case, performance is more difficult to quantify than for the dense case, but, for example, banded matrices with bandwidth can be factored with work; we expect subcubic complexity in N for reasonably sparse matrices, and strive for subquadratic complexity, for very sparse matrices. The triangular solves can be accomplished in work proportional to the number of entries in the respective triangular matrix L or U. The pivoting strategy is treated as a parameter of the algorithm and is not predetermined. We can consequently treat the pivoting function as an application-dependent function, and sometimes tailor it to special problem structures (cf., Section 7 of [Velde:88a]) for higher performance. As for all sparse solvers, we also seek subquadratic memory requirements in N, attained by storing matrix entries in linked-list fashion, as illustrated in Figure 9.14.
Figure 9.14: Linked-list Entry Structure of Sparse Matrix. A single entry
consists of a double-precision value (8 bytes), the local row (i) and column
(j) index (2 bytes each), a ``Next Column Pointer'' indicating the next
current column entry (fixed j), and a ``Next Row Pointer'' indicating the next
current row entry (fixed i), at 4 bytes each. Total: 24 bytes per entry.
For further discussion of LU factorizations and sparse matrices, see [Golub:89a], [Duff:86a].