We can view the reorganized matrix-multiply program in two ways. First, we can consider the matrices A, B, and C as matrices whose elements are matrices. In this case, the inner three loops simply perform a multiply-add of one such block-element. This is the view taken by most numerical analysts. Second, we can derive the blocked program form the original, unblocked program by a sequence of standard program transformations. First, the individual loops are strip mined. For example, the loop
mmmmmmmmmm¯is replaced byfor i = 1 to n do
od
mmmmmmmmmm¯(Strip mining is a standard tool for dealing with vector registers. One may apply it ``legally'' to any loop. By legally, we mean that the transformed program computes the same result as before.) Strip mining, by itself, yields a six-loop program, but the order of the loops is not what is needed for a blocked algorithm. The second transformation we use is . In general, this means changing the order of loops and hence the order in which computation is done. To block a program, we endeavor to move the strip loops (the i0, j0, and k0 loops above) to the outside and the point loops (the i, j, and k loops above) to the inside. This interchange is what causes repeated references to the elements of small blocks. In the matrix-multiply example, the interchange is legal, but there are many interesting programs for which it is not, including LU and QR decompositions and methods for partial differential equations.for i0 = 1 to n step b do
for i = i0 to do
od
od
This approach to automatic blocking, through loop strip mining and interchange, was first advocated by Wolfe [18]. It is derived from earlier work of Abu-Sufah, Kuck, and Lawrie on optimization in a paged virtual-memory system [1]. Wolfe introduced the term tiling. A tile is the collection of work to be done, i.e., the set of values of the point loop indices, for a fixed set of values of the block or outer loop indices. We like this terminology since it allows us to distinguish what we are doing -- which is to decompose the work to be done into small subtasks (the tiles) -- from the quite different task of decomposing the data a priori into small subarrays (the blocks), even though each tile does, in fact, act on blocks. Following Wolfe, we call the problem of decomposing the work of a loop nest index space tiling.
Other authors have treated the issue of management of the memory hierarchy [8]. Some other treatments of the problem of automatic blocking have recently appeared [11], [4], [17], [18], [19]; none, however, gives the quantitative statments of the problem and the solution that we provide here.