Suppose that a partial factorization of A has been obtained so that the first k columns of L and the first k rows of U have been evaluated. Then we may write the partial factorization in block partitioned form, with square blocks along the leading diagonal, as
where and
are
matrices, and
P is a permutation matrix representing the effects of pivoting.
Pivoting is performed to improve the numerical stability of the algorithm
and involves the interchange of matrix rows. The
blocks labeled
in Eq. 1 are the updated
portion of A that has not yet been factored, and will be referred to as
the active submatrix.
We next advance the factorization by evaluating the next block column of L and the next block row of U, so that
where is a permutation matrix of order M-k. Comparing
Eqs. 1 and
2 we see that the factorization is advanced by first factoring the
first block column of the active submatrix which will be referred to as
the current column,
This gives the next block column of L. We then pivot the active submatrix to the right of the current column and the partial L matrix to the left of the current column,
and solve the triangular system
to complete the next block row of U. Finally, a matrix-matrix product is
performed to update ,
Now, one simply needs to relabel the blocks to advance to the next block step.
The main advantage of the block partitioned form of the
LU factorization algorithm is that the updating of
(see Eq. 6)
involves a matrix-matrix operation if the block size is greater than 1.
Matrix-matrix operations generally perform more efficiently than matrix-vector
operations on high performance computers.
However, if the block size is equal to 1, then a matrix-vector operation is
used to perform an outer product -- generally the least efficient of
the Level 2 BLAS
[7] since it updates the whole submatrix.
Note that the original array A may be used to store the factorization, since the L is unit lower triangular and U is upper triangular. Of course, in this and all of the other versions of LU factorization, the additional zeros and ones appearing in the representation do not need to be stored explicitly.
We now derive the cost for performing I/O to and from disk for the
block-partitioned, right-looking LU factorization of an
matrix A with a block size of
. For clarity assume M is exactly
divisible by
. The factorization proceeds in
steps which we shall index
. For some general step k, the active submatrix
is the
matrix in the lower right corner of A, where
. In step k it is necessary to both read and write all of the
active submatrix, so the total I/O cost for the right-looking algorithm is
where R and W are the times to read and write one matrix element, respectively, and we assume there is no startup cost when doing I/O.