The way the data is distributed over the memory hierarchy of a computer is of fundamental importance to load balancing and software reuse. The block cyclic data distribution allows a reduction of the overhead due to load imbalance and data movement. Block-partitioned algorithms are used to maximize the local node performance.
Since the data decomposition largely determines the performance and scalability of a concurrent algorithm, a great deal of research [27, 65, 69, 78] has focused on different data decompositions [10, 20, 85]. In particular, the two-dimensional block cyclic distribution [92] has been suggested as a possible general-purpose basic decomposition for parallel dense linear algebra libraries [31, 76, 97, 17] such as ScaLAPACK.
Block cyclic distribution is beneficial because of its scalability [51] , load balance, and communication [76] properties. The block-partitioned computation then proceeds in consecutive order just as a conventional serial algorithm does. This essential property of the block cyclic data distribution explains why the ScaLAPACK design has been able to reuse the numerical and software expertise of the sequential LAPACK library.