In this section, we consider the data layout of dense matrices on distributed-memory machines, with the goal of making dense matrix computations as efficient as possible. We shall discuss a sequence of data layouts, starting with the most simple, obvious, and inefficient one and working up to the complicated but efficient ScaLAPACK ultimately uses. Even though our justification is based on Gaussian elimination, analysis of many other algorithms has led to the same set of layouts. As a result, these layouts have been standardized as part of the High Performance Fortran standard [91], with corresponding data declarations as part of that language.
The two main issues in choosing a data layout for dense matrix computations are
It will help to remember the pictorial representation of Gaussian elimination below. As the algorithm proceeds, it works on successively smaller square southeast corners of the matrix. In addition, there is extra Level 2 BLAS work to factorize the submatrix .
Figure 4.2: Gaussian elimination using Level 3 BLAS
For convenience we will number the processes from 0 to P-1, and matrix columns (or rows) from 1 to N. The following two figures shows a sequence of data layouts we will consider. In all cases, each submatrix is labeled with the number of the process (from 0 to 3) that contains it. Process 0 owns the shaded submatrices.
Consider the layout illustrated on the left of figure 4.3, the one-dimensional block column distribution. This distribution
Figure 4.3: The one-dimensional block and cyclic column distributions
assigns a block of contiguous columns of a matrix to successive processes. Each process receives only one block of columns of the matrix. Column k is stored on process where is the maximum number of columns stored per process. In the figure N=16 and P=4. This layout does not permit good load balancing for the above Gaussian elimination algorithm because as soon as the first tc columns are complete, process 0 is idle for the rest of the computation. The transpose of this layout, the one-dimensional block row distribution, has a similar shortfall for dense computations.
The second layout illustrated on the right of figure 4.3, the one-dimensional cyclic column distribution, addressed this problem by assigning column k to process (k-1) mod P. In the figure, N=16 and P=4. With this layout, each process owns approximately of the square southeast corner of the matrix, so the load balance is good. However, since single columns (rather than blocks) are stored, we cannot use the Level 2 BLAS to factorize and may not be able to use the Level 3 BLAS to update . The transpose of this layout, the one-dimensional cyclic row distribution, has a similar shortfall.
The third layout shown on the left of figure 4.4, the one-dimensional block-cyclic column distribution, is a compromise between the distribution schemes shown in figure 4.3. We choose a block size NB, divide the columns into groups of size NB, and distribute these groups in a cyclic manner. This means column k is stored in process . In fact, this layout includes the first two as the special cases, and NB=1, respectively. In the figure N=16, P=4 and NB=2. For NB larger than 1, this has a slightly worse balance than the one-dimensional cyclic column distribution, but can use the Level 2 BLAS and Level 3 BLAS for the local computations. For NB less than tc, it has a better load balance than the one-dimensional block column distribution, but can call the BLAS only on smaller subproblems. Hence, it takes less advantage of the local memory hierarchy. Moreover, this layout has the disadvantage that the factorization of will take place on one process (in the natural situation where column blocks in the layout correspond to column blocks in Gaussian elimination), thereby representing a serial bottleneck.
Figure 4.4: The one-dimensional block-cyclic column- and the two-dimensional
block-cyclic distributions
This serial bottleneck is eased by the fourth layout shown on the right of figure 4.4, the two-dimensional block cyclic distribution. Here, we think of our P processes arranged in a rectangular array of processes, indexed in a two-dimensional fashion by , with and . All the processes with a fixed are referred to as process column . All the processes with a fixed are referred to as process row . Thus, this layout includes all the previous layouts, and their transposes, as special cases. In the figure, N=16, P=4, , and MB=NB=2. This layout permits -fold parallelism in any column, and calls to the Level 2 BLAS and Level 3 BLAS on local subarrays. Finally, this layout also features good scalability properties as shown in [61].
The two-dimensional block cyclic distribution scheme is the data layout that is used in the ScaLAPACK library for dense matrix computations.