next up previous contents
Next: Future Directions Up: Performance Previous: Choice of Block Size

Choice of Grid Size


The best grid shape is determined by the algorithm implemented in the library and the underlying physical network. A one link physical network will favor tex2html_wrap_inline1527 or tex2html_wrap_inline1529 , where tex2html_wrap_inline1531 is the process grid. This affects the scalability of the algorithm, but reduces the overhead due to message collisions. It is possible to predict the best grid shape given the number of processes available. The current algorithms for the factorization or reduction routines can be split into two categories.

If at every step of the algorithm a block of columns and/or rows needs to be broadcast, as in the LU or QR factorizations, it is possible to pipeline this communication phase and overlap it with some computation. The direction of the pipeline determines the shape of the grid. For example, the LU, QR and QL factorizations perform better for ``flat'' process grids ( tex2html_wrap_inline1543 ). These factorizations share a common bottleneck of performing a reduction operation along each column (for pivoting in LU, and for computing a norm in QR and QL). The first implication of this observation is that large latency message passing perform better on a ``flat'' grid than on a square grid. Secondly, after this reduction has been performed, it is important to update the next block of columns as fast as possible. This is done by broadcasting the current block of columns using a ring topology, i.e, feeding the ongoing communication pipe. Similarly, the performance of the LQ and RQ factorizations take advantage of ``tall'' grids ( tex2html_wrap_inline1555 ) for the same reasons, but transposed.

The theoretical efficiency of the LU factorization can be estimated by (3):


For large n, the last term in the denominator dominates, and it is minimized by choosing a tex2html_wrap_inline1561 slightly smaller than tex2html_wrap_inline1563 . tex2html_wrap_inline1565 works well on Intel machines. For smaller n, the middle term dominates, and it becomes more important to choose a small tex2html_wrap_inline1561 . Suppose that we keep the ratio tex2html_wrap_inline1571 constant as P increases, thus we have tex2html_wrap_inline1575 and tex2html_wrap_inline1577 , where u and v are constant [10]. Moreover, let us ignore the tex2html_wrap_inline1583 factor for a moment. In this case, tex2html_wrap_inline1585 and tex2html_wrap_inline1587 are proportional to tex2html_wrap_inline1589 and tex2html_wrap_inline1523 must grow with P to maintain efficiency. For sufficiently large tex2html_wrap_inline1561 , the tex2html_wrap_inline1583 factor cannot be ignored, and the performance will slowly degrade with the number of processors P. This phenomenon is observed in practice as shown in Figure 10 for the efficiency of the LU factorization on the Intel Paragon.

The second group of routines are two-sided algorithms as opposed to one-sided algorithms. In these cases, it is not usually possible to maintain a communication pipeline, and thus square or near square grids are more optimal. This is the case for the algorithms used for implementing the Cholesky factorization, the matrix inversion and the reduction to bidiagonal form (BRD), Hessenberg form (HRD), tridiagonal form (TRD) and the nonsymmetric QR eigenvalue algorithm (HQR). For example, the update phase of the Cholesky factorization of a lower symmetric matrix physically transposes the current block of columns of the lower triangular factor.

Assume now that at most P processes are available. A natural question arising is: could we decide what process grid tex2html_wrap_inline1607 should be used? Similarly, depending on P, it is not always possible to factor tex2html_wrap_inline1611 to create the appropriate grid. For example, if P is prime, the only possible grids are tex2html_wrap_inline1615 and tex2html_wrap_inline1617 . If such grids are particularly bad for performance, it may be beneficial to let some processors remain idle, so the remainder can be formed into a ``squarer'' grid [24]. These problems can be analyzed by a complicated function of the machine and problem parameters. It is possible to develop models depending on the machine and problem parameters which accurately estimate the impact of modifying the shape of the grid on the total execution time, as well as predicting the necessary amount of extra memory required for each routine.


Figure: Intel Paragon LU and QR Performance


Figure: IBM SP-2 and Intel Paragon Cholesky Factorization

Figure: Execution speed in Mflop/s for the band algorithm on the IBM SP-2.

Figure: Performance of the Parallel QR Algorithm on the Intel MP Paragon for a 2x2 up to a 9x9 grid of processes.


Figure: Performance with different blocksize NB (MB=NB) and prediction of percentage of time to performance operations.


Figure: IBM SP-2 and Intel Paragon Performance

next up previous contents
Next: Future Directions Up: Performance Previous: Choice of Block Size

Susan Blackford
Thu Jul 25 15:38:00 EDT 1996