next up previous
Next: Blocking Examples Up: Automatic Blocking of Nested Previous: Local Memory Requirements

Optimal Choice of Block Size

In this section we present solutions to two important instances of the general problem of optimal choice of the block size parameters tex2html_wrap_inline2458. We assume that a set of tiling hyperplane normals A has been chosen and that the data fluxes E and C are known, as are the dependences D. The choice of the outermost point loop index -- tex2html_wrap_inline2412, will also play a role.

Here our viewpoint is somewhat more realistic than in Section 3. We take into account the fact that not all the data required by the execution of a tile must be read a priori. Instead, we consider the order in which the tiles iterations are processed and assume that the needed data are read (or written) at the time they are needed.

We need some constants to make our estimates precise. Let the amount of work per grid point be tex2html_wrap_inline1582 (the appropriate units for tex2html_wrap_inline1582 and the constants tex2html_wrap_inline1584 that follow are seconds, so that the machine characteristics are included through these constants). Let the flux of data per unit surface area across the face normal to tex2html_wrap_inline2368 be tex2html_wrap_inline1584. The way that tex2html_wrap_inline1584 depends on E, C, and tex2html_wrap_inline2368 was explained in Section 5.

First we consider the case k = 2 with the assumption that tex2html_wrap_inline2412 is one of the two tiling vectors, say tex2html_wrap_inline2492. Then the amount of local memory required is proportional to tex2html_wrap_inline1890 and is independent of tex2html_wrap_inline1886. The total work done is tex2html_wrap_inline2498 and the amount of data referred to is tex2html_wrap_inline2500. Thus the ratio of computation time to memory access time is
displaymath2450
as tex2html_wrap_inline2502. (We have redefined the dimensionless parameter tex2html_wrap_inline1572 here). See Figure 5.

  figure670
Figure 5: Reuse ratio tex2html_wrap_inline1572 vs. tile length. Note: tex2html_wrap_inline2508

In this case, therefore, we always take tex2html_wrap_inline1886 as large as possible (subject only to the size of the problem being solved) and obtain the ratio shown. This ratio is the product of a ratio of work per iteration tex2html_wrap_inline1582 to data per iteration tex2html_wrap_inline2514 and the number of grid points tex2html_wrap_inline1890 that fit in the local memory. Note in particular that for large problems, for which tex2html_wrap_inline1886 can be taken so large that the asymptote is approached, neither the data per unit surface in the direction of tex2html_wrap_inline2520 (that is, tex2html_wrap_inline2522) nor the particular choice of tile length in the tex2html_wrap_inline2396 direction (that is, tex2html_wrap_inline1886) plays a role. Similar conclusions are reached if we model execution time rather than the computation to communication ratio tex2html_wrap_inline1572. Note also that if the ratio tex2html_wrap_inline2530 is larger than tex2html_wrap_inline2532, then we choose tex2html_wrap_inline2534 instead of tex2html_wrap_inline2396.

The discussion above is little changed if we allow arbitrary tex2html_wrap_inline2412. What matters is that we fix all but one of the block size parameters and allow the other to grow, provided that with the given choice of tex2html_wrap_inline2412 the memory requirement is independent of this one parameter. For that to be true, all we need is that tex2html_wrap_inline2412 should not be close to tex2html_wrap_inline2520 rather than the much stronger requirement tex2html_wrap_inline2492.

Next let us take tex2html_wrap_inline2492 and k > 2. Again, we fix all but one of the block size parameters, in this case tex2html_wrap_inline2552 and allow the other one to grow, limited only by problem size.

Let tex2html_wrap_inline2554. Memory size places some upper limit on B. Let the memory required per unit surface in the hyperplane normal to tex2html_wrap_inline2396 be M. Thus, for the given choice of the block size parameters, the local memory required is tex2html_wrap_inline2562. If the available local memory has room for tex2html_wrap_inline1580 data, then B is constrained by
 equation685
The amount of work per unit distance in the tex2html_wrap_inline2396 direction is tex2html_wrap_inline2572.

Finally, the data required per unit distance in the tex2html_wrap_inline2396 direction is
displaymath2451

Thus
displaymath2452
With the constraint on b given by (7), the maximizing choice of b is
displaymath2453
where tex2html_wrap_inline2584.


next up previous
Next: Blocking Examples Up: Automatic Blocking of Nested Previous: Local Memory Requirements

Jack Dongarra
Tue Feb 18 15:39:11 EST 1997