An important difference between ScaLAPACK and LAPACK is that a parallel computing environment, possibly consisting of a heterogeneous collection of processors, introduces new sources of possible errors not found in the serial environment in which LAPACK runs. These errors could indeed afflict any parallel algorithm that uses floating-point arithmetic. For example, consider the following pseudocode, executed in parallel by several processors:
s= global_sum(x) ... each processor receives the sum s of global array xif s < thresh then
return my part of answer 1
else
do more computations
return my part of answer 2
end if
It is possible for the value of s to differ from processor to processor; we call this incoherence. This can happen if the floating-point arithmetic varies from processor to processor (we call this heterogeneity), since processors may not even share the same set of floating-point numbers. The value of s can also vary if global_sum accumulates the sum in different orders on different processors, since floating-point addition is not associative. In either case, the test s< thresh may be true on one processor but not another, so that the program may inconsistently return answer 1 on some processors and answer 2 on others. If the ``more computations'' include communication with synchronization, even deadlock could result.
Deadlock can also result if the floating-point numbers communicated from one processor to another cause fatal floating-point errors on the receiving processor. For example, if an IBM RS/6000, running in its default mode, sends a message containing a denormalized number [7, 8] to a DEC Alpha running in its default mode, then the DEC Alpha aborts [19].
It is also possible for global_sum to compute the same s on all processors but compute a different s from run to run of the program, for example, if global_sum computes the sum in a nondeterministic order on one processor and broadcasts the result to all processors. We call this nonrepeatability. If this happens, debugging the overall code can be more difficult.
Coherence and repeatability are independent properties of an algorithm. It is possible in principle for an algorithm running on a particular platform to be incoherent and repeatable, coherent and nonrepeatable, or any other combination. On a different platform, the same algorithm may have different properties.
Reference [19] contains a more extensive discussion of these possible errors.
One run of a ScaLAPACK routine is designed to be as reliable as LAPACK, so that errors due to incoherence cannot occur as long as ScaLAPACK is executed on a homogeneous network of processors. The following conditions apply:
The above conditions guarantee that a single ScaLAPACK call is as reliable as its LAPACK counterpart. If, in addition, identical answers from one run to another are desired (i.e., repeatability), this can be guaranteed at runtime by calling BLACS_SET to enforce repeatability of the BLACS, and the ScaLAPACK routines that use them, by using an appropriate topology (see the BLACS users guide [54] for details).
Maintaining coherence on a heterogeneous network is harder, and not always possible. If floating-point formats differ (say, on a Cray C90 and IBM RS/6000, which uses IEEE arithmetic), there is no cost-effective way to guarantee coherence. If floating-point formats are the same, however, operations such as global sums can accumulate the result on one processor and broadcast it to guarantee coherence (except for the problem of DEC Alphas and denormalized numbers mentioned above). The BLACS do this, except when using the ``bidirectional exchange'' topology. One can avoid using ``bidirectional exchange'' and so guarantee coherence whenever possible, by calling BLACS_SET to enforce coherence (see the BLACS users guide [54] for details).
Still other ScaLAPACK routines are guaranteed to work only on homogeneous networks (PxGESVD and PxSYEV). These routines do large numbers of redundant calculations on all processors and depend on the results of these calculations being the same. There are too many of these calculations to cost-effectively compute them all on one processor and broadcast the results.
The user may wonder why ScaLAPACK and the BLACS are not designed to guarantee coherence and repeatability in the most general possible situations, so that calling BLACS_SET would not be necessary. The reason is that the possible bugs described above are quite rare, and so ScaLAPACK and the BLACS were designed to maximize performance instead. Provided the mere sending of floating-point numbers does not cause a fatal error, these bugs cannot occur at all in most ScaLAPACK routines, because branches depending on a supposedly identical floating-point value like s do not occur. For most other ScaLAPACK routines where such branches do occur, we have not seen these bugs despite extensive testing, including attempts to cause them to occur. Complete understanding and cost-effective elimination of such possible bugs are future work.
In the meantime, to get repeatability when running on a homogeneous network, we recommend calling BLACS_SET as described above when using the following ScaLAPACK drivers: PxGESVX, PxPOSVX, PxSYEV, PxSYEVX, PxGESVD, and PxSYGVX.