In this section the implementation of the sequential, left-looking, out-of-core LU factorization routine will be discussed. As we shall see in Section 6, once the sequential version has been implemented it is a relatively easy task to parallelize it using the BLACS, PBLAS, and ScaLAPACK, and the parallel out-of-core routines described in Section 5.
In the out-of-core algorithm only two block columns of width
may be in-core at any
time. One of these is the block column being updated and factored which
we shall refer to as the active block. The other is one of the
block columns lying to the left of the active block column which we shall
refer to as a temporary block.
As we saw in Section 2.2, the three main computational
tasks in a step of the left-looking algorithm are a triangular
solve (Eq. 10), a matrix-matrix multiplication
(Eq. 11), and an LU factorization (Eq. 12).
In the out-of-core algorithm the triangular solve and matrix-matrix
multiplication steps are intermingled so that a temporary block can play its
part in both of these operations but be read only once. To clarify this,
consider the role that block column i plays in the factorization of
block column k (where i<k). In Figure 3, the first i rows of
block column i play no role in factoring block column k. The lower
triangular
portion of the next
rows of
block column i are labeled
, and the next
rows are
labeled
.
The last M-k rows are labeled D. The corresponding portions of
block column k are labeled
,
, and E. Then the part played by
block column
i in factoring block column k can be expressed in the following three
operations,
where in Eqs. 19 and 20 we use the
given by Eq. 18. It should be noted that
Eqs. 19 and 20 can be combined in a single
matrix-matrix multiplication operation
Figure: Partitioning of temporary block i and active block k.
In updating block column k, the out-of-core algorithm sweeps over all block columns to the left of block column k and performs for each the triangular solve in Eq. 18 and the matrix-matrix multiplication in Eq. 21. After all the block columns to the left of the block have been processed in this way using the Level 3 BLAS routines _TRSM and _GEMM [6], the matrix E is then factored using the LAPACK routine _GETRF [1].
If the matrix is stored on disk without applying the pivots to it, then whenever a block column is read in the pivots found up to that point must be applied to it using _LASWP, an LAPACK auxiliary routine. Also after updating and factoring the active block, the pivots must be applied to it in reverse order to undo the effect of pivoting before storing the block column to disk. In this version of the left-looking algorithm complete block columns are always read or written. In the version of the algorithm in which the matrix is stored on disk in pivoted form it is necessary to read in only those parts of the temporary blocks that play a role in the computation. When a partial temporary block is read in the pivots found when factoring E in the previous step must be applied before using it, and it must then be written back out to disk.
In Figure 4 the pseudocode is presented for the version of
the left-looking algorithm in which the matrix is stored in unpivoted
form. Since a vector of pivot information is maintained in-core, the
factored matrix can always be read in later to be pivoted.
It has been assumed in Figure 4 that the matrix is
and that M is divisible by the block size
. However,
the general case is scarcely more complicated.
It should be noted that it is necessary to position the file pointer
(at the start of the file) only once in each pass through the outer loop.
Figure: Pseudocode for out-of-core, left-looking LU factorization algorithm
that leaves matrix in unpivoted form.