In this paper, we have studied tiling techniques aimed at adapting the granularity of uniform loop nest algorithms towards execution on distributed-memory machines. We view tiling as a two-step process: the first step amounts to determining the best shape and size of the tiles (assuming an infinite grid of virtual processors), while the second step consists in mapping and scheduling the tiles to physical processors. We have concentrated on the second step, assuming a realistic model where (independent) communication and computation may overlap. We have obtained several new results, including a strong result on the optimal mapping and scheduling. However, much remains to be done to extend these results to arbitrary dimensions and domain shapes.
More generally, the relationship between tiling, scheduling and mapping is not yet well understood, and the two-step approach may not prove too complicated for practical problems. Yet, such a two-step approach is typical in the field of parallelizing compilers (other examples are general task graph scheduling, software pipelining and loop parallelization algorithms).
Finally, the recent development of heterogeneous computing platforms may well lead to using tiles whose size and shape will depend upon the characteristics of the processors they are assigned to ... a truly challenging problem!