In the factorization or reduction routines, the work distribution becomes uneven as the computation progresses. A larger block size results in greater load imbalance, but reduces the frequency of communication between processes. There is, therefore, a tradeoff between load imbalance and communication startup cost, which can be controlled by varying the block size. This is illustrated in Figure 9 with a model of the computational time.
Most of the computation of the ScaLAPACK routines is performed in a blocked fashion using Level 3 BLAS, as is done in LAPACK. The computational blocking factor is chosen to be the same as the distribution block size. Therefore, smaller distribution block sizes increase the loop and index computation overhead. However, because the computation cost ultimately dominates, the influence of the block size on the overall communication startup cost and loop and index computation overhead decreases very rapidly with the problem size for a given grid of processes. Consequently, the performance of the ScaLAPACK library is not very sensitive to the block size, as long as the extreme cases are avoided. A very small block size leads to BLAS 2 operations and poorer performance. A very large block size leads to computational imbalance.
One exception is PDLAHQR (the nonsymmetric QR eigenvalue algorithm.) Here, the computational blocking factor has an upper bound given by the distribution block size. But the distribution block size corresponds to border communications, and so the algorithm desires block sizes larger than one might typically require for Level 3 based algorithms. A wise choice is therefore critical to performance and ultimately redistribution might be necessary for unwise choices.
The chosen block size impacts the amount of workspace needed on
every process. This amount of workspace is typically large enough
to contain a block of columns or a block of rows of the matrix
operands. Therefore, the larger the block size, the greater the
necessary workspace, i.e the smaller the largest solvable problem
on a given grid of processes.
For Level 3 BLAS blocked algorithms, the smallest possible block
operands are of size . Therefore, it is good practice
to choose the block size to be the problem size for which the BLAS
matrix-multiply GEMM routine achieves 90 % of its reachable peak.
Determining optimal, or near optimal block sizes for different
environments is a difficult task because it depends on many
factors including the machine architecture, speeds of the
different BLAS levels, the latency and bandwidth of message
passing, the number of process available, the dimensions of
the process grid, the dimension of the problem, and so on.
However, there is enough evidence and expertise for automatically
and accurately determining optimal, or near optimal block sizes
via an enquiry routine. Furthermore, for small problem sizes it
is also possible to determine if redistributing data items
is an acceptable cost in terms of performance as well as memory
usage. In the future, we hope to calculate the optimal block size
via an enquiry routine.