Now, we turn to a timing example of an order 2500 sparse, random matrix. The matrix has a random diagonal, plus 2 percent random fill of the off-diagonals; entries have a dynamic range of 0-10,000. Normally, data is averaged over random matrices for each grid shape (as noted), and over four repetitive runs for each random matrix. Partial row pivoting was used exclusively. Table 9.5 compiles timings for various grid shapes of row-scatter/column-scatter, and row-scatter/column-(S=10) distributions, for as few as nine nodes and as many as 128. Memory limitations set the lower bound on the number of nodes.
This example demonstrates that speedups are possible for this reasonably small sparse example with this general-purpose solver, and that the one-parameter distribution is critical to achieving overall better performance even for this random, essentially unstructured example. Without the one-parameter distribution, triangular-solve performance is poor, except in grid configurations where the factorization is itself degraded (e.g., ). Furthermore, the choice of S=10 is universally reasonable for the Q > 1 grid shapes illustrated here, so the distribution proves easy to tune for this type of matrix. We are able to maintain an almost constant speed for the triangular solves while increasing speed for both the A-mode and B-mode factorizations. We presume, based on experience, that triangular-solve times are comparable to the sequential solution times-further study is needed in this area to see if and how performance can be improved. The consistent A-mode to B-mode ratio of approximately two is attributed primarily to reduced communication costs in B-mode, realized through the elimination of essentially all combine operations in B-mode.
Table 9.5: Order 2500 Matrix Performance. Performance is a function of grid
shape and size, and S-parameter. ``Best'' performance is for the
grid with S=10.
While triangular-solve performance exemplifies sequentialism in the algorithm, it should be noted that we do achieve significant overall performance improvements between 6 nodes and 96 ( grid) nodes, and that the repeatedly used B-mode factorization remains dominant compared to the triangular solves even for 128 nodes. Consequently, efforts aimed at increasing performance of the B-mode factorization (at the expense of additional A-mode work) are interesting to consider. For the factorizations, we also expect that we are achieving nontrivial speedups relative to one node, but we are unable to quantify this at present because of the memory limitations alluded to above.