The data decomposition (or distribution) is a major factor in determining the efficiency of a concurrent matrix algorithm, so before detailing the research into concurrent linear algebra done at Caltech, we shall first introduce some basic decomposition strategies.
The processors of a concurrent computer can be uniquely labelled by
, where
is the number of
processors. A vector of length M may be decomposed over the
processors by assigning the vector entry with global index m (where
) to processor p, where it is stored as the ientry
in a local array. Thus, the decomposition of a vector can be regarded
as a mapping of the global index, m, to an index pair,
,
specifying the processor number and local index.
For matrix problems, the processors are usually arranged as a
grid. Thus, the grid consists of P rows of processors and Q
columns of processors, and
. Each processor can be uniquely
identified by its position,
, on the processor grid. The
decomposition of an
matrix can be regarded as the
Cartesian product of two vector decompositions,
and
. The mapping
decomposes the M rows of the matrix
over the P processor rows, and
decomposes the N columns of
the matrix over the Q processor columns. Thus, if
and
, then the matrix entry with global index
is
assigned to the processor at position
on the processor grid,
where it is stored in a local array with index
.
Two common decompositions are the linear and scattered
decompositions. The linear decomposition, , assigns
contiguous entries in the global vector to the processors in blocks,
where
and and
. The scattered decomposition,
, assigns consecutive entries in the global vector to different
processors,
Figure 8.1 shows examples of these two types of decomposition
for a matrix.
Figure 8.1: These Eight Figures Show Different Ways of Decomposing a
Matrix. Each cell represents a matrix entry, and is
labelled by the position,
, in the processor grid of the
processor to which it is assigned. To emphasize the pattern of
decomposition, the matrix entries assigned to the processor in the
first row and column of the processor grid are shown shaded. Figures
(a) and (b) show linear and scattered row-oriented decompositions,
respectively, for four processors arranged as a
grid
(P=4, Q=1). In Figures (c) and (d), the corresponding
column-oriented decompositions are shown (P=1, Q=4). Figures (e)
through (h) show linear and scattered block-oriented decompositions for
16 processors arranged as a
grid (P=Q=4).
The mapping of processors onto the processor grid is determined by the programming methodology, which in turn depends closely on the concurrent hardware. For machines such as the nCUBE-1 hypercube, it is advantageous to exploit any locality properties in the algorithm in order to reduce communication costs. In such cases, processors may be mapped onto the processor grid by a binary Gray code scheme [Fox:88a], [Saad:88a], which ensures that adjacent processors on the processor grid are directly connected by a communication channel. For machines such as the Symult 2010, for which the time to send a message between any two processors is almost independent of their separation in the hardware topology, locality of communication is not an issue, and the processors can be mapped arbitrarily onto the processor grid.