The Laplace solver that we used for the test run embodies the typical operation that is done with finite-element meshes. This operation is matrix-vector multiply. Thus, we are not testing load-balancing strategies just for a Laplace solver but for a general class of applications, namely, those which use matrix-vector multiply as the heart of a scheme which iterates to convergence on a fixed mesh, then refines the mesh and repeats the convergence.

The case of the Laplace solver has a high ratio of communication to calculation, as may be seen from the discussion of Section 11.1.1, and thus brings out differences in load-balancing algorithms particularly well.

Each load-balancing algorithm may be measured by three criteria:\

- the quality of the solution it produces, measured by the time per
iteration in the solver;
- the time it takes to do the load balancing, measured by the time it
takes to solve the graph-coloring problem and by the number of elements which
must then be migrated; and
- the portability of the method for different kinds of applications with different kinds of meshes, and the number of parameters that must be set to obtain optimal performance from the method.

Orthogonal recursive bisection is certainly cheap, both in terms of the time it takes to solve the graph-coloring problem and the number of elements which must be migrated. It is also portable to different applications, the only required information being the dimensionality of the mesh. And it is easy to program. Our tests indicate, however, that more expensive methods can improve performance by over 20%. Because ORB pays no attention to the connectivity of the element graph, one suspects that as the geometry of the underlying domain and solution becomes more complex, this gap will widen.

Simulated annealing is actually a family of methods for solving optimization problems. Even when run sequentially, care must be taken in choosing the correct set of changes that may be made to the state space, and in choosing a temperature schedule to ensure a good optimum. We have tried a ``brute force'' parallelization of simulated annealing, essentially ignoring the parallelism. For sufficiently slow cooling, this method produces the best solution to the load-balancing problem when measured either against the load-balance cost function, or by timings on a real parallel computer. Unfortunately, it takes a long time to produce this high-quality solution, perhaps because some of the numerous input parameters are not set optimally. A more sensitive treatment is probably required to reduce or eliminate parallel collisions [Baiardi:89a]. Clearly, further work is required to make SA a portable and efficient parallel load balancer for parallel finite-element meshes. True portability may be difficult to achieve for SA, because the problem being solved is graph coloring, and graphs are extremely diverse; perhaps something approaching an expert system may be required to decide the optimal annealing strategy for a particular graph.

Eigenvalue recursive bisection seems to be a good compromise between the other methods, providing a solution of quality near that of SA at a price little more than that of ORB. There are few parameters to be set, which are concerned with the Lanczos algorithm for finding the second eigenvector. Mathematical analysis of the ERB method takes place in the familiar territory of linear algebra, in contrast to analysis of SA in the jungles of nonequilibrium thermodynamics. A major point in favor of ERB for balancing finite-element meshes is that the software for load balancing with ERB is shared to a large extent with the body of finite-element software: The heart of the eigenvector calculation is a matrix-vector multiply, which has already been efficiently coded elsewhere in the finite-element library. Recursive spectral bisection [Barnard:93a] has been developed as a production load balancer and very successfully applied to a variety of finite-element problems.

The CP research described in this section has been continued by Mansour in Fox's new group at Syracuse [Mansour:91a;92a-e].

He has considered simulating annealing, genetic algorithms, neural networks, and spectral bisection producing in each parallel implementation. Further, he introduced a multiscale or graph contraction approach where large problems to be decomposed are not directly tackled but are first ``clumped'' or contracted to a smaller problem [Mansour:93b], [Ponnusamy:93a]. The latter can be decomposed using the basic techniques discussed above and this solution of the small problem used to initialize a fast refinement algorithm for the original large problem. This strategy has an identical philosophy to the multigrid approach (Section 9.7) for partial differential equations. We are currently collaborating with Saltz in integrating these data decomposers into the high-level data-parallel languages reviewed in Section 13.1.

Wed Mar 1 10:19:35 EST 1995