Addressing the global decomposition problem poses problems of a much more serious nature than the previous dependence analysis and local stencil methods because, while many clever compiler-related tricks are known to help the local problems, there is little theoretical analysis of more global problems. Only very recently, for example, do we find compilers that perform any kind of interprocedural analysis at all.
As a result, the resolution of this problem is really one which concerns the parallel programming model available to the parallelization tools. Again, ASPAR is unique in this respect.
To understand some of the possibilities, it is again useful to create a classification scheme for global decomposition strategies. It is interesting to note that, in some sense, the complexity of these strategies is closely related to our initial comments about the ``degree of difficulty'' of parallel processing.
This style is the simplest of all. We have a situation in which there are no data dependences among functional units other than initial and final output. Furthermore, each ``function'' can proceed independently of the others. The global decomposition problem is solved by virtue of never having appeared at all.
In this type of situation, the run-time requirements of the parallel processing system are quite small-typically, a ``send'' and ``receive'' paradigm is adequate to implement a ``master-slave'' processing scenario. This is the approach used by systems such as Linda [Padua:86a] and Strand [Foster:90a].
Of course, there are occasional complexities involved in this style of programming, such as the use of ``broadcast'' or data-reduction techniques to simplify common operations. For this reason higher level systems such as Express are often easier to use than their ``simpler'' contemporaries since they include standard mechanisms for performing commonly occurring operations.
This type of application is typified by areas such as numerical integration or convolution operations similar to those previously described.
Their characteristic is that while there are data dependences among program elements, these can be analyzed symbolically at compile time and catered for by suitable insertion of calls to a message-passing (for distributed-memory) or locking/unlocking (for shared-memory) library.
In the convolution case, for example, we provide calls which would arrange for the distribution of data values among processors and the communication of the ``boundary strip'' which is required for updates to the local elements.
In the integration example, we would require routines to sum up contributions to the overall integral computed in each node. For this type of application, only simple run time primitives are required.
Problems such as those encountered in large-scale scientific applications typically have behavior in which the global decomposition schemes for data objects vary in some standard manner throughout the execution of the program, but in a deterministic way which can be analyzed at compile-time: One routine might require an array to be distributed row-by-row, for example, while another might require the same array to be partitioned by columns or perhaps not at all.
These issues can be dealt with during the parallelization process but require much more sophisticated run-time support than those previously described. Particularly if the resulting programs are to scale well on larger numbers of nodes, it is essential that run-time routines be supplied to efficiently perform matrix transposition or boundary cell exchange or global concatenation operations. For ASPAR, these operations are provided by the Express run-time system.
The three categories of decomposition described so far can deal with a significant part of a large majority of ``real'' applications. By this we mean that good implementations of the various dependence analysis, dependence ``breaking'' and run-time support systems can correctly parallelize 90% of the code in any application that is amenable to automatic parallelization. Unfortunately, this is not really good enough.
Our real goal in setting out to produce automatic parallelization tools is to relieve the user of the burden of performing tricky manipulations by hand. Almost by definition, the 10% of each application left over by the application of the techniques described so far is the most complex part and probably represents about 95% of the complexity in parallelizing the original code by hand! So, at this point, all we have achieved is the automatic conversion of the really easy parts of the code, probably at the expense of introducing messy computer-generated code, which makes the understanding of the remaining 10% very difficult.
The solution to this problem comes from the adoption of a much more sophisticated picture of the run time environment.