next up previous
Next: Performance Results Up: Key Concepts For Parallel Previous: Implementation Issues

A Parallel Algorithm

  Although in section 2 we saw that the left-looking LU factorization routine has a lower I/O cost that the right-looking variant, the left-looking algorithm has less inherent parallelism since it acts only on single blocks. We therefore propose a hybrid parallel algorithm in which a single block actually spans several widths of the process grid, say tex2html_wrap_inline1665 . In effect, the matrix is now blocked at two levels. It is divided into blocks of size tex2html_wrap_inline1485 elements, which are distributed cyclicly over the process grid, but we apply the left-looking algorithm to ``superblocks'' of width tex2html_wrap_inline1669 columns where the process grid is assumed to be of size tex2html_wrap_inline1611 . If tex2html_wrap_inline1665 is chosen large enough we have a pure right-looking algorithm, and if tex2html_wrap_inline1665 and Q are both 1 we essentially recover the pure left-looking algorithm. Within a superblock we use a right-looking LU factorization algorithm (P_GETRF) to get good parallelism, but at the superblock level we employ a left-looking algorithm to control I/O costs. The parameter tex2html_wrap_inline1665 can be used to trade off parallelism and I/O cost.

In Figure 7 we show an example for a tex2html_wrap_inline1215 process grid, and tex2html_wrap_inline1221 . For clarity we consider here a matrix consisting of only four column superblocks, though in a ``real'' application we would expect the number to be much larger. In Figure 7 the first two superblocks have been factored, while the third and fourth superblocks have not yet been changed. We now consider the next stage of the algorithm in which the third superblock, for which the data distribution is shown explicitly, is factored. Note that each of the small numbered squares is actually an tex2html_wrap_inline1217 block, with the numbering indicating the position in the process grid to which it is assigned. At the end of this stage of the algorithm the first three superblocks will have been factored, and the fourth will still be unchanged. In the following we shall refer to the superblock being factored as the active superblock.

   figure505
Figure: Schematic view of the parallel hybrid out-of-core algorithm for the case tex2html_wrap_inline1219 and tex2html_wrap_inline1221 .

The parallel implementation closely follows the sequential implementation presented in Section 3. Block columns are read and written using the routines P_READ and P_WRITE. The file pointer is positioned with P_GSEEK. These routines are part of the BLAPIO library introduced in Section 5. The triangular solve and matrix multiplication are done using PBLAS routines. Pivoting is performed by the ScaLAPACK auxiliary routine P_LAPIV, while the factorization is done by the ScaLAPACK routine P_GETRF. Since all these routines reference matrices as global data structures, parallelization of the sequential algorithm is almost trivial. Pseudocode for the parallel version is given in Figure 8.

   figure513
Figure: Pseudocode for parallel, out-of-core, left-looking LU factorization algorithm that leaves matrix in unpivoted form.


next up previous
Next: Performance Results Up: Key Concepts For Parallel Previous: Implementation Issues

Jack Dongarra
Thu Apr 18 21:51:24 EDT 1996