If the matrix is
stored in CDS format, it is still possible to
perform a matrix-vector product by either rows or columns, but this
does not take advantage of the CDS format. The idea
is to make a change in coordinates in the doubly nested loop. Replacing
we get
The algorithm will now have a doubly nested loop with the outer loop
enumerating the diagonals diag=-p,q with and the
(nonnegative) numbers of diagonals to the left and right of the main
diagonal. The bounds for the inner loop follow from the requirement
that
for i = 1, n y(i) = 0 end; for diag = -diag_left, diag_right for loc = max(1,1-diag), min(n,n-diag) y(loc) = y(loc) + val(loc,diag) * x(loc+diag) end; end;
The transpose matrix-vector product is a minor variation of the algorithm above. Using the update formula
for i = 1, n y(i) = 0 end; for diag = -diag_right, diag_left for loc = max(1,1-diag), min(n,n-diag) y(loc) = y(loc) + val(loc+diag, -diag) * x(loc+diag) end; end;The memory access for the CDS-based matrix-vector product (or ) is three vectors per inner iteration. On the other hand, there is no indirect addressing, and the algorithm is vectorizable with vector lengths of essentially the matrix order . Because of the regular data access, most machines can perform this algorithm efficiently by keeping three base registers and using simple offset addressing.