We have discussed the mapping of the logical hierarchical memory to
physical memory. In addition, we have pointed out the importance of
maintaining long inner loops to get good sequential performance for each
process, and the desirability of sending a few large messages rather than
many smaller ones. We next consider load balance issues. Assuming that
equal numbers of processes have been assigned to each processor, load imbalance
arises in two phases of the parallel LU factorization algorithm; namely, in
factoring each column block, which involves only P processes, and in
solving the lower triangular system to evaluate each row block of U, which
involves only Q processes. If the time for data movement is negligible, the
aspect ratio of the template that minimizes load imbalance in step k of
the algorithm is,
where is the matrix size in blocks, and r the block size.
Thus, the optimal aspect ratio of the template should be the same as the aspect
ratio of the matrix, i.e., in blocks, or M/N in elements.
If the effect of communication time is included then we must take into
account the relative times taken to locate and broadcast the pivot
information, and the time to broadcast the lower triangular matrix, ,
along a row of the template. For both tasks the communication time increases
with the number of processes involved, and since the communication time
associated with the pivoting is greater than that associated with the triangular
solve, we would expect the optimum aspect ratio of the template to be
less than M/N. In fact, for our runs on the Intel Delta system we found an
aspect ratio, P/Q, of between 1/4 and 1/8 to be optimal for most problems with
square matrices, and that performance depends rather weakly on the aspect
ratio, particularly for large grain sizes. Some typical results are shown in
Figure 19 for 256 processors, which show a variation of less
than 20% in performance as P/Q varies between 1/16 and 1 for the largest
problem.
Figure 19: Performance of LU factorization on the Intel Delta as a function
of square matrix size for different processor templates containing
approximately 256 processors. The best performance is for an aspect ratio
of 1/4, though the dependence on aspect ratio is rather weak.
The block size, r, also affects load balance. Here the tradeoff is between the load imbalance that arises as rows and columns of the matrix are eliminated as the algorithm progresses, and communication startup costs. The block cyclic decomposition seeks to maintain good load balance by cyclically assigning blocks to processes, and the load balance is best if the blocks are small. On the other hand, cumulative communication startup costs are less if the block size is large since, in this case, fewer messages must be sent (although the total volume of data sent is independent of the block size). Thus, there is a block size that optimally balances the load imbalance and communication startup costs.