Although in section 2 we saw that the left-looking LU factorization routine has a lower I/O cost that the right-looking variant, the left-looking algorithm has less inherent parallelism since it acts only on single blocks. We therefore propose a hybrid parallel algorithm in which a single block actually spans several widths of the process grid, say . In effect, the matrix is now blocked at two levels. It is divided into blocks of size elements, which are distributed cyclicly over the process grid, but we apply the left-looking algorithm to ``superblocks'' of width columns where the process grid is assumed to be of size . If is chosen large enough we have a pure right-looking algorithm, and if and Q are both 1 we essentially recover the pure left-looking algorithm. Within a superblock we use a right-looking LU factorization algorithm (P_GETRF) to get good parallelism, but at the superblock level we employ a left-looking algorithm to control I/O costs. The parameter can be used to trade off parallelism and I/O cost.
In Figure 7 we show an example for a process grid, and . For clarity we consider here a matrix consisting of only four column superblocks, though in a ``real'' application we would expect the number to be much larger. In Figure 7 the first two superblocks have been factored, while the third and fourth superblocks have not yet been changed. We now consider the next stage of the algorithm in which the third superblock, for which the data distribution is shown explicitly, is factored. Note that each of the small numbered squares is actually an block, with the numbering indicating the position in the process grid to which it is assigned. At the end of this stage of the algorithm the first three superblocks will have been factored, and the fourth will still be unchanged. In the following we shall refer to the superblock being factored as the active superblock.
Figure: Schematic view of the parallel hybrid out-of-core algorithm for
the case and .
The parallel implementation closely follows the sequential implementation presented in Section 3. Block columns are read and written using the routines P_READ and P_WRITE. The file pointer is positioned with P_GSEEK. These routines are part of the BLAPIO library introduced in Section 5. The triangular solve and matrix multiplication are done using PBLAS routines. Pivoting is performed by the ScaLAPACK auxiliary routine P_LAPIV, while the factorization is done by the ScaLAPACK routine P_GETRF. Since all these routines reference matrices as global data structures, parallelization of the sequential algorithm is almost trivial. Pseudocode for the parallel version is given in Figure 8.
Figure: Pseudocode for parallel, out-of-core, left-looking LU factorization
algorithm that leaves matrix in unpivoted form.