Overlapping communication and computation

Previous: Inner products
Up: Inner products
Next: Fewer synchronization points
Previous Page: Inner products
Next Page: Fewer synchronization points

Overlapping communication and computation

Clearly, in the usual formulation of conjugate gradient-type methods the inner products induce a synchronization of the processors, since they cannot progress until the final result has been computed: updating and can only begin after completing the inner product for . Since on a distributed-memory machine communication is needed for the inner product, we cannot overlap this communication with useful computation. The same observation applies to updating , which can only begin after completing the inner product for .

Figure shows a variant of CG, in which all communication time may be overlapped with useful computations. This is just a reorganized version of the original CG scheme, and is therefore precisely as stable. Another advantage over other approaches (see below) is that no additional operations are required.

This rearrangement is based on two tricks. The first is that updating the iterate is delayed to mask the communication stage of the inner product. The second trick relies on splitting the (symmetric) preconditioner as , so one first computes , after which the inner product can be computed as where . The computation of will then mask the communication stage of the inner product.

Under the assumptions that we have made, CG can be efficiently parallelized as follows:

The communication required for the reduction of the inner product for can be overlapped with the update for , (which could in fact have been done in the previous iteration step).
The reduction of the inner product for can be overlapped with the remaining part of the preconditioning operation at the beginning of the next iteration.
The computation of a segment of can be followed immediately by the computation of a segment of , and this can be followed by the computation of a part of the inner product. This saves on load operations for segments of and .

For a more detailed discussion see Demmel, Heath and Van der Vorst [64]. This algorithm can be extended trivially to preconditioners of

form, and nonsymmetric preconditioners in the Biconjugate Gradient Method.

Previous: Inner products
Up: Inner products
Next: Fewer synchronization points
Previous Page: Inner products
Next Page: Fewer synchronization points