Consider a fixed logical process grid of R processes, with . For the sake of argument, assume partial row pivoting during LU factorization for the retention of numerical stability. Then, for the LU factorization, it is well known that a scatter distribution is ``good'' for the matrix rows, and optimal if no off-diagonal pivots were chosen. Furthermore, the optimal column distribution is also scatter, because columns are chosen in order for partial row pivoting. Compatibly, a scatter distribution of matrix rows is also ``good'' for the triangular solves. However, for triangular solves, the best column distribution is linear, because this implies less intercolumn communication, as we detail below. In short, the optimal configurations conflict, and because explicit redistribution is expensive, a static compromise must be chosen. We address this need to compromise through the one-parameter distribution function, , described in the previous section, offering a variable degree of scattering via the S-parameter. To first order, changing S does not affect the cost of computing the Jacobian (assuming columnwise finite-difference computation), because each process column works independently.
It's important to note that triangular solves derive no benefit from Q > 1. The standard column-oriented solve keeps one process column active at any given time. For any column distribution, the updated right-hand-side vectors are retransmitted W times (process column-to-process column) during the triangular solve-whenever the active process column changes. There are at least such transmissions (linear distribution), and at most transmissions (scatter distribution). The complexity of this retransmission is , representing quadratic work in N for .
Calculation complexity for a sparse triangular solve is proportional to the number of elements in the triangular matrix, with a low leading coefficient. Often, there are with x < 1 elements in the triangular matrices, including fill. This operation is then , which is less than quadratic in N. Consequently, for large W, the retransmission step is likely of greater cost than the original calculation. This retransmission effect constrains the amount of scattering and size of Q in order to have any chance of concurrent speedup in the triangular solves.
Using the one-parameter distribution with implies that , so that the retransmission complexity is . Consequently, we can bound the amount of retransmission work by making S sufficiently large. Clearly, is a hard upper bound, because we reach the linear distribution limit at that value of the parameter. We suggest picking as a first guess, and , more optimistically. The former choice basically reduces retransmission effort by an order of magnitude. Both examples in the following section illustrate the effectiveness of choosing S by these heuristics.
The two-parameter distribution can be used on the matrix rows to trade off load balance in the factorizations and triangular solves against the amount of (communication) effort needed to compute the Jacobian. In particular, a greater degree of scattering can dramatically increase the time required for a Jacobian computation (depending heavily on the underlying equation structure and problem), but significantly reduce load imbalance during the linear algebra steps. The communication overhead caused by multiple process rows suggests shifting toward smaller P and larger Q (a squatter grid), in which case greater concurrency is attained in the Jacobian computation, and the additional communication previously induced is then somewhat mitigated. The one-parameter distribution used on the matrix columns then proves effective in controlling the cost of the triangular solves by choosing the minimally allowable amount of column scattering.
Let's specify make explicit the performance objectives we consider when tuning S, and, more generally, when tuning the grid shape . In the modified Newton iteration, for instance, a Jacobian factorization is reused until convergence slows unacceptably. An ``LU Factorization + Backsolve'' step is followed by ``Forward + Backsolves,'' with typically (and varying dynamically throughout the calculation). Assuming an averaged , say (perhaps as large as five [Brenan:89a]), then our first-level performance goal is a heuristic minimization of
over S for fixed P, Q. more heavily weights the reduction of triangular solve costs versus B-mode factorization than we might at first have assumed, placing a greater potential gain on the one-parameter distribution for higher overall performance. We generally want heuristically to optimize
over S, P, Q, R. Then, the possibility of fine-tuning row and column distributions is important, as is the use of non-power-of-two grid shapes.