The best grid shape is determined by the algorithm implemented in the library and the underlying physical network. A one link physical network will favor or . This affects the scalabilty of the algorithm, but reduces the overhead due to message collisions. It is possible to predict the best grid shape given the number of processes available. The current algorithms for the factorization or reduction routines can be split into two categories.
If at every step of the algorithm a block of columns and/or rows needs to be broadcast, as in the LU or QR factorizations, it is possible to pipeline this communication phase and overlap it with some computation. The direction of the pipeline determines the shape of the grid. For example, the LU, QR and QL factorizations perform better for ``flat'' process grids (). These factorizations share a common bottleneck of performing a reduction operation along each column (for pivoting in LU, and for computing a norm in QR and QL). The first implication of this observation is that large latency message passing perform better on a ``flat'' grid than on a square grid. Secondly, after this reduction has been performed, it is important to update the next block of columns as fast as possible. This is done by broadcasting the current block of columns using a ring topology, i.e, feeding the ongoing communication pipe. Similarly, the performance of the LQ and RQ factorizations take advantage of ``tall'' grids () for the same reasons, but transposed.
The theoretical efficiency of the LU factorization can be estimated by (see (1), (2)):
For large , the last term in the denominator dominates, and it is minimized by choosing a slightly smaller than . works well on Intel machines. For smaller , the middle term dominates, and it becomes more important to choose a small . Suppose that we keep the ratio constant as increases, thus we have and , where and are constant . Moreover, let ignore the factor for a moment. In this case, and are proportional to and must grow with to maintain efficiency. For sufficient large , the factor cannot be ignored, and the performance will slowly degrade with the number of processors . This phenomenon is observed in practice as shown in the plot above showing the efficiency of the LU factorization on the Intel Paragon.
Click here to view the left figure
Click here to view the right figure
The second group of routines physically transpose a block of columns and/or rows at every step of the algorithm. In these cases, it is not usually possible to maintain a communication pipeline, and thus square or near square grids are more optimal. This is the case for the algorithms used for implementing the Cholesky factorization, the matrix inversion and the reduction to bidiagonal form (BRD), Hessenberg form (HRD) and tridiagonal form (TRD). For example, the update phase of the Cholesky factorization of a lower symmetric matrix physically transposes the current block of columns of the lower triangular factor.
Assume now that at most processes are available. A natural question arising is: could we decide what process grid should be used? Similarly, depending on , it is not always possible to factor to create the appropriate grid. For example, if is prime, the only possible grids are and . If such grids are particularly bad for performance, it may be beneficial to let some processors remain idle, so the remainder can be formed into a ``squarer'' grid . These problems can be analyzed by a complicated function of the machine and problem parameters. It is possible to develop models depending on the machine and problem parameters which accurately estimate the impact of modifying the shape of the grid on the total execution time, as well as predicting the necessary amount of extra memory required for each routine.