Our first example is an algorithm that uses plane rotations
for the QR factorization of real matrix *X*. In the description of these example algorithms we suppress
all irrelevant detail. To that end, we use the notation *f*(*x*, *y*, *z*) to mean a generic function of the arguments
*x*, *y*, and *z* which may be a different function at every occurrence.

mmmmmmmmmmmmmmmm¯

fork= 1 tondo

fori=mtok+1 step -1 do(1) (

c,s) =f(x(i,k),x(i-1,k));

forj=k+1 tondo(2) ;

od

od

od

There are two distinct true dependences here. Statement (2) at iteration (*i*,*j*,*k*) depends on statement (2)
at iterations
(*i*+1, *j*, *k*) (because x(i,j) is changed there)
(*i*-1, *j*, *k*-1) (because x(i-1,j) is changed there).
Each iteration (*i*,*j*,*k*) of statment (2) depends on statement (1) at iteration (*i*,*k*), so that is a column
of *C*. Furthermore, statement (1) depends on statement (2) at iterations (*i*, *k*, *k*-1) and (*i*-1,*k*,*k*-1). Therefore,
through the uses of *c* and *s*, statement (2) depends on itself at iterations (*i*,*k*,*k*-1) and (*i*-1,*k*,*k*-1); this
dependence is weaker that a dependence on iteration (*i*, *j*-1, *k*-1) and (*i*-1,*j*-1,*k*-1), so if we take these to be the
actual dependences we are going to be safe.
There are also antidependences and output dependences, but these can be ignored for the moment.
Thus,

and

In this case, there are only three rays of the cone, namely,

After improvement we arrive at the basis

Thus, the new indices are *j*, *k*, and *k*-*i*.

After replacing the index *i* by we have the following program:

Strip mining producesmmmmmmmmmmmmmmmm¯

fork= 1 tondo

forr=k-mto -1 do(

c(r,k),s(r,k)) =f(x(k-r,k),x(k-r-1,k));

forj=k+1 tondo;

od

od

od

Then loop interchanging producesmmmmmmmmmmmmmmmm¯

fork0 = 1 tonstepbdo

fork=k0 to do

forr0 =k0-mto -1 stepbdo

for to do(

c(r,k),s(r,k)) =f(x(k-r,k),x(k-r-1,k));

forj0 =k0+1 tonstepbdo

for to do;

od

od

od

od

od

od

mmmmmmmmmmmmmmmm¯

fork0 = 1 tonstepbdo

forr0 =k0-mto -1 stepbdo

forj0 =k0 tonstepbdo

fork=k0 to do

for to do

ifj0 =k0 then (c(r,k),s(r,k)) =f(x(k-r,k),x(k-r-1,k));

for to do;

od

od

od

od

od

od

**Figure 6:** Blocking of the *QR* factorization of an matrix with .

This rather complicated blocked algorithm works as follows.
We illustrate with *m* = 20, *n* = 15, *b* = (5, 5, 5).
Elements of *X* are eliminated by plane rotations in patches, as shown in Figure 6. The values of *k*0 and
*r*0 at which elements are eliminated is shown in each patch.
The rotations used to do the elimination
are applied only to columns in the current patch (during the block operation with *j*0 = *k*0).
These rotations are stored and later applied to columns to the right of the patch (when *j*0 > *k*0).

For another example, consider the following program for the *QR* factorization which uses
Householder transforms rather than plane rotations. In this pseudo-code we use
the notation *x*(*k*:*m*,*j*) to refer to the vector of elements .
We include it as an example of a program that can
be blocked without using linear loop index transformation.

mmmmmmmmmmmmmmmm¯

fork= 1 tondo;

x(k,k) =f(x(k,k),s(k));

forj=k+1 tonstep

do;

x(k:m,j) =x(k:m,j) +s'(k,j) *x(k:m,k);od

od

In Fortran, loops would be triply nested. The compiler, on detecting a dependence of some subsequent statement on the whole of an inner loop implementing a reduction operation, such as the norm and the inner product in the example, should choose to view those loops as atomic and therefore work with an index space of reduced dimension.

The dependences in (*j*,*k*) space are

The basis chosen is the obvious one: *A* = *I*. Thus, no skewing is done.

Now, we choose the order of the block loops. The measure of data flux given in Section 5 is the same for and for ; so neither order is preferred.

Note, however, that the two dependences are different in character. The dependence is a true dependence at every
point of the index space. The other, expresses the dependence of iteration (*j*,*k*) on ``iteration''
(*k*,*k*)
(the task performed outside the inner loop for given *k*);
the data that are required are used in common by all the iterations with fixed *k*. Thus, for the purpose of
determining data flux, this dependence direction should be given weight 1 (as are columns of *C*), not 2. If we
make this change, the flux is greater for , so we make the *k* block loop innermost.
This procedure yields a left-looking method in which all groups of Householder transforms
are applied to a block of columns just before that block is triangularized. This allows the block to be held in
local memory during the application of these transforms.

**Acknowledgement**
We would like to thank Ilse Ipsen for her help at the beginning of our work on this problem and Mike
Wolfe for his at the end.

**Appendix. Proof of Lemma 3.**

Let the matrices
and .
The face is subtended by the columns of .
Let the matrix be factored

where *Q* is an orthogonal matrix, *R* is an upper triangular matrix, is ,
and is ; thus , and is a unit vector in the direction
normal to the range of , which is the span of . The matrix is the
orthogonal projector on .
The factorization (9) always exists and is unique up to
signs on the diagonal of *R* [10].

The columns of are the coordinates of the columns of with respect to the orthonormal basis
(for the subspace of in which lies)
consisting of the columns of . Thus

We must therefore show that .

From the identity it follows that ; thus

The proof is complete if we can show that .
But since ,

The result now follows from Lemma 1.

Tue Feb 18 15:39:11 EST 1997