The way in which a matrix is distributed over the processes has a major impact on the load balance and communication characteristics of the concurrent algorithm, and hence largely determines its performance and scalability. The block cyclic distribution provides a simple, yet general-purpose way of distributing a block-partitioned matrix on distributed memory concurrent computers. It has been incorporated in the High Performance Fortran standard [11].
The block cyclic data distribution is parameterized by the four
numbers ,
,
, and
, where
is the
process template and
is the block size.
Suppose first that we have objects, indexed by an integer
, to map onto
processes, using block size
.
The
-th item will be stored in the
-th location of block
on process
, where
In the special case where
and
are powers of two, this mapping
is really just bit extraction, with
equal to the rightmost
bits of
,
equal to the next
bits of
,
and
equal to the remaining leftmost bits of
. The distribution
of a block-partitioned matrix can be regarded as the tensor product of
two such mappings: one that distributes the rows of the matrix over
processes, and another that distributes the columns over
processes. That is, the matrix element indexed globally by
is stored in location
The nonscattered decomposition (or pure block distribution) is just the
special case and
.
Similarly a purely scattered decomposition (or two dimensional wrapped
distribution) is the special case
.