In this section we present solutions to two important instances of the general problem of optimal choice of the block size parameters . We assume that a set of tiling hyperplane normals A has been chosen and that the data fluxes E and C are known, as are the dependences D. The choice of the outermost point loop index -- , will also play a role.
Here our viewpoint is somewhat more realistic than in Section 3. We take into account the fact that not all the data required by the execution of a tile must be read a priori. Instead, we consider the order in which the tiles iterations are processed and assume that the needed data are read (or written) at the time they are needed.
We need some constants to make our estimates precise. Let the amount of work per grid point be (the appropriate units for and the constants that follow are seconds, so that the machine characteristics are included through these constants). Let the flux of data per unit surface area across the face normal to be . The way that depends on E, C, and was explained in Section 5.
First we consider the case k = 2 with the assumption that is one of the two tiling vectors,
say . Then the amount of local memory required is proportional to
and is independent of . The total work done is and the amount of
data referred to is .
Thus the ratio of computation time to memory access time is
as . (We have redefined the dimensionless parameter here).
See Figure 5.
Figure 5: Reuse ratio vs. tile length. Note:
In this case, therefore, we always take as large as possible (subject only to the size of the problem being solved) and obtain the ratio shown. This ratio is the product of a ratio of work per iteration to data per iteration and the number of grid points that fit in the local memory. Note in particular that for large problems, for which can be taken so large that the asymptote is approached, neither the data per unit surface in the direction of (that is, ) nor the particular choice of tile length in the direction (that is, ) plays a role. Similar conclusions are reached if we model execution time rather than the computation to communication ratio . Note also that if the ratio is larger than , then we choose instead of .
The discussion above is little changed if we allow arbitrary . What matters is that we fix all but one of the block size parameters and allow the other to grow, provided that with the given choice of the memory requirement is independent of this one parameter. For that to be true, all we need is that should not be close to rather than the much stronger requirement .
Next let us take and k > 2. Again, we fix all but one of the block size parameters, in this case and allow the other one to grow, limited only by problem size.
Let . Memory size places some upper limit on B.
Let the memory required per unit surface in the hyperplane normal to be M. Thus, for the given choice
of the block size parameters, the local memory required is .
If the available local memory has room for data, then B is constrained by
The amount of work per unit distance in the direction is .
Finally, the data required per unit distance in the direction is
Thus
With the constraint on b given by (7), the maximizing choice of b is
where .