Program Transformation and Blocking; Previous Work

Next: Strip Mining and Loop Up: Introduction Previous: Block algorithms: Matrix Multiplication

Program Transformation and Blocking; Previous Work

We can view the reorganized matrix-multiply program in two ways. First, we can consider the matrices A, B, and C as matrices whose elements are matrices. In this case, the inner three loops simply perform a multiply-add of one such block-element. This is the view taken by most numerical analysts. Second, we can derive the blocked program form the original, unblocked program by a sequence of standard program transformations. First, the individual loops are strip mined. For example, the loop

mmmmmmmmmm¯ for i = 1 to n do od

is replaced by

mmmmmmmmmm¯ for i0 = 1 to n step b do for i = i0 to do od od

(Strip mining is a standard tool for dealing with vector registers. One may apply it ``legally'' to any loop. By legally, we mean that the transformed program computes the same result as before.) Strip mining, by itself, yields a six-loop program, but the order of the loops is not what is needed for a blocked algorithm. The second transformation we use is . In general, this means changing the order of loops and hence the order in which computation is done. To block a program, we endeavor to move the strip loops (the i0, j0, and k0 loops above) to the outside and the point loops (the i, j, and k loops above) to the inside. This interchange is what causes repeated references to the elements of small blocks. In the matrix-multiply example, the interchange is legal, but there are many interesting programs for which it is not, including LU and QR decompositions and methods for partial differential equations.

This approach to automatic blocking, through loop strip mining and interchange, was first advocated by Wolfe [18]. It is derived from earlier work of Abu-Sufah, Kuck, and Lawrie on optimization in a paged virtual-memory system [1]. Wolfe introduced the term tiling. A tile is the collection of work to be done, i.e., the set of values of the point loop indices, for a fixed set of values of the block or outer loop indices. We like this terminology since it allows us to distinguish what we are doing -- which is to decompose the work to be done into small subtasks (the tiles) -- from the quite different task of decomposing the data a priori into small subarrays (the blocks), even though each tile does, in fact, act on blocks. Following Wolfe, we call the problem of decomposing the work of a loop nest index space tiling.

Other authors have treated the issue of management of the memory hierarchy [8]. Some other treatments of the problem of automatic blocking have recently appeared [11], [4], [17], [18], [19]; none, however, gives the quantitative statments of the problem and the solution that we provide here.

Next: Strip Mining and Loop Up: Introduction Previous: Block algorithms: Matrix Multiplication

Jack Dongarra
Tue Feb 18 15:39:11 EST 1997