Banded Matrix-Vector Multiplication
First, we consider the parallelization of the operation on a linear array of
processors when
is a banded
matrix with
,
upper and lower bandwidths, and we assume that matrices are
stored using a sparse scheme [Rice:85a]. For simplicity, we
describe the case
. The proposed implementation is
based on a decomposition of matrix
into an upper
(including the diagonal of
) and lower
triangular
matrices, such as
. Furthermore, we
assume that row
and
are stored
in processor i. Without loss of generality, we can assume
and
. The vector
can then be expressed as
. The products
and
are computed within
and
iterations, respectively.
The computation involved is described in Figure 8.3. In
order to compute the complexity of the above algorithm, we assume
without any loss of generality, that
has K non-zero elements,
and
. Then it can be shown that the time complexity is
and the memory space required for each subdomain is .
Figure 8.3: The Pseudo Code for Banded Matrix-Vector Multiplication
Banded Matrix-Matrix Multiplication
Second, we consider the implementation of , on a ring of
processors when
,
are banded
matrices with
upper,
and
lower bandwidths, respectively. Again, we describe the
realization for
. The case
is
straightforward generalization. The processor i computes column
of matrix
and holds one row of matrix
(denoted by
) and a column of matrix
(denoted by
).
The algorithm consists of two phases as in banded-matrix vector
multiplication. Without loss of generality, we can assume , and
. In the first phase, each node starts by
calculating
, then each node i
passes
to node i-1, this phase is repeated
times. In the second phase, each node restores
and passes it to node i+1. This phase is repeated
times. The implementation proposed for this operation is
described in Figure 8.4.
Figure 8.4: The Pseudo Code for Banded Matrix-Matrix Multiplication
Without loss of generality, we assume that are the number
of non-zero elements for the matrices
,
respectively,
and denote by
and
. Then
we can show that the parallel execution time
is given by
The above realization has been implemented on the nCUBE-1 [Chrisochoides:90a]. Tables 8.1 and 8.2 indicate the performance of BLAS 2 computation for a block tridiagonal matrix where each block is dense. In these experiments, each processor has the same computation to perform. The results indicate very satisfactory performance for these type of data.
Table 8.1: Measured maximum total elapsed time (in seconds) for
multiplication of a block tridiagonal matrices with a vector.
Table 8.2: Measured maximum elapsed time (in seconds) for multiplication
of a block tridiagonal matrix by a block tridiagonal matrix.