Next: Solvers.
Up: Parallelism J. Dongarra and
Previous: Vector Updates.
  Contents
  Index
Matrix-Vector Products.
The matrix-vector products are often easily parallelized on shared memory
machines by splitting the matrix in strips of rows corresponding to the vector
segments. Each processor then computes the matrix-vector product of one
strip.
For distributed memory machines, there may be a problem if each processor
has only a segment of the vector in its memory. Depending on the bandwidth
of the matrix, we may need communication for other elements of the vector,
which may lead to communication bottlenecks. However, many sparse
matrix problems arise from a network in which only nearby nodes are
connected. For example, matrices stemming
from finite difference or finite element problems typically involve
only local connections: matrix element
is nonzero
only if variables
and
are physically close.
In such a case, it seems natural to subdivide the network, or
grid, into suitable blocks and to distribute them over the processors.
When computing
, each processor requires the values of
at
some nodes in neighboring blocks. If the number of connections to these
neighboring blocks is small compared to the number of internal nodes,
then the communication time can be overlapped with computational work.
More recently, graph partitioning has been used as a powerful tool
to deal with the general problems with nonlocal connections.
Consider
and the undirected graph
of the (symmetric)
matrix
. Assume vertex
stores
, and the nonzero
for
all
. Let vertex
represent the job of computing
. We can assign weight of vertex
to be
the number of operations for computing
, and the weight of
edge
to be 1, which represents the cost of sending
to
vertex
if vertices
and
were mapped on different processors.
A good graph partitioning heuristic would partition
into
subsets
of vertices corresponding to
processors, such that:
- the sums of vertex weights are approximately equal among the subsets,
- the sum of edge cuts crossing the subsets is minimized.
This ensures a good load balance and minimizes communication.
Good graph partitioning software includes Chaco [223]
and METIS [259] (also ParMETIS, the parallel version).
After partitioning, further optimizations can be performed to reduce
communication. For example, if a subset of vertices contains
several edges to another subset, the corresponding elements of the
vector can be packed into one message, so that each processor sends no
more than one message to another processor.
For more detailed discussions on implementation aspects for distributed
memory systems, see De Sturler and van der Vorst [111,112] and
Pommerell [369].
High-quality parallel algorithms for distributed memory machines are
implemented in software packages such as Aztec [236]
and PETSc [38].
The software can be obtained via the book's homepage, ETHOME.
Next: Solvers.
Up: Parallelism J. Dongarra and
Previous: Vector Updates.
  Contents
  Index
Susan Blackford
2000-11-20