Clearly, in the usual formulation of conjugate gradient-type methods the inner products induce a synchronization of the processors, since they cannot progress until the final result has been computed: updating and can only begin after completing the inner product for . Since on a distributed-memory machine communication is needed for the inner product, we cannot overlap this communication with useful computation. The same observation applies to updating , which can only begin after completing the inner product for .
Figure shows a variant of CG, in which all communication time may be overlapped with useful computations. This is just a reorganized version of the original CG scheme, and is therefore precisely as stable. Another advantage over other approaches (see below) is that no additional operations are required.
This rearrangement is based on two tricks. The first is that updating the iterate is delayed to mask the communication stage of the inner product. The second trick relies on splitting the (symmetric) preconditioner as , so one first computes , after which the inner product can be computed as where . The computation of will then mask the communication stage of the inner product.
Figure: A rearrangement of Conjugate Gradient for parallelism
Under the assumptions that we have made, CG can be efficiently parallelized as follows: