As described for his chemistry application in Section 8.2, Hipes has studied the use of the Gauss-Jordan (GJ) algorithm as a means of solving systems of linear equations [Hipes:89b]. On a sequential computer, LU factorization followed by forward reduction and back substitution is preferable over GJ for solving linear systems since the former has a lower operation count. Another apparent drawback of GJ is that it has generally been believed that the right hand sides must be available a priori, which in applications requiring the solution for multiple right-hand sides is a handicap. Hipes' work has shown that this is not the case, and that a well-written, parallel GJ solver is significantly more efficient than using LU factorization with triangular solvers on hypercubes.

As noted by Gerasoulis, et al. [Gerasoulis:88a], GJ does not
require the solution of triangular systems. The solution of such systems by
LU factorization features an outer loop of fixed length and two inner loops
of decreasing length, whereas GJ has two outer fixed-length loops and only
one inner loop of decreasing length. GJ is, therefore, intrinsically more
parallel than the LU solver, and its better load balance compensates for its
higher operation count. Hipes has pointed out that the multipliers generated
in the GJ algorithm can be saved where zeros are produced in the coefficient
matrix. The entries in the coefficient matrix are, therefore, overwritten by
the GJ multipliers, and we shall call this the GJ factorization (although
we are not actually expressing the original matrix **A** as the product of two
matrices). It is now apparent that the right-hand side matrix does
*not* have to be known in advance, since a solution can be obtained
using the previously computed multipliers. Another factor, noted by Hipes,
favoring the use of the GJ solver on a multiprocessor, is the larger grain
size maintained throughout the GJ factorization and solution phases, and the
lower communication cost in the GJ solution phase.

Hipes has implemented his GJ solver on the Caltech/JPL Mark III and nCUBE-1 hypercubes, and compared the performance with the usual LU solver [Hipes:89d]. In the GJ factorization, a scattered column decomposition is used, similar to that shown in Figure 8.1(d). This ensures good load balance as columns become eliminated in the course of the algorithm. In the LU factorization, both rows and columns are eliminated so a scattered block decomposition is used. On both machines, it was found that the GJ approach is faster for sufficiently many right-hand sides.

Wed Mar 1 10:19:35 EST 1995