To achieve the necessary reuse of data in local memory, researchers have developed many new methods
for computation involving matrices and other data arrays [6, 7, 16]. Typically an algorithm that
refers to individual elements is replaced by one that operates on subarrays of data, which are
called *blocks* in the matrix computing field. The operations on subarrays can be
expressed in the usual way. The advantage of this approach is that the small blocks can be moved
into the fast local memory and their elements can then be repeatedly used.

The standard example is matrix multiplication. The usual program is

The entire computation involves arithmetic operations (counting additions and multiplications separately), but produces and consumes only data values. As a whole, the computation exhibits admirable reuse of data. In general, however, an entire matrix will not fit in a small local memory. The work must therefore be broken into small chunks of computation, each of which uses a small enough piece of the data. Note that for each iteration of the outer loop (mmmmmmmmmm¯

fori= 1 tondo

forj= 1 tondo

fork= 1 tondo

c[i,j] =c[i,j] +a[i,k] *b[k,j] ;od

od

od

Now consider a blocked matrix-multiply algorithm.

First, note that in this program exactly the same operations are done on the same data; even round-off error is identical. Only the sequence in which independent operations are performed is different from the unblocked program. There is still reuse in the whole program of ordermmmmmmmmmmmmmmmm¯

fori0 = 1 tonstepbdo

forj0 = 1 tonstepbdo

fork0 = 1 tonstepbdo

fori=i0 to do

forj=j0 to do

fork=k0 to do

c[i,j] =c[i,j] +a[i,k] *b[k,j] ;od

od

od

od

od

od

The subject of this paper is the __ of ordinary programs to blocked form.
__

__
Our motivation for seeking such a capability is as follows. Many algorithms can be blocked. Indeed,
recent work by numerical analysts has shown that the most important computations for dense
matrices are blockable. A major software development of blocked algorithms for linear algebra has been
conducted as a result [5]. Further examples, in particular in the solution of partial
differential equations by difference and variational methods, are abundant. Indeed, many such codes
have also been rewritten as block methods to better use the small main memory and large solid-state
disk on Cray supercomputers [14]. All experience with these techniques has shown them to be
enormously effective at squeezing the best possible performance out of advanced architectures.
__

__
On the other hand,
blocked code is much more difficult to write and to understand than point code.
Writing it is a difficult and error-prone job.
Blocking introduces
block size parameters that have nothing to do with the problem being solved and which must be adjusted
for each computer and each algorithm if good performance is to be achieved.
Unfortunately, the alternative to having blocked code is worse:
poor performance on important computations with the most
powerful computers. For these reasons,
Kennedy has stated that compiler management of memory hierarchies is the most important
and most difficult task facing the writers of compilers for high-performance machines [12].
__

__
__

Tue Feb 18 15:39:11 EST 1997