There are special challenges associated with writing reliable
numerical software on networks containing heterogeneous
processors, i.e., processors which may do floating point
arithmetic differently. This includes not just machines
with completely different floating point formats and
semantics (e.g. Cray versus workstations running IEEE
standard floating point arithmetic), but even supposedly
identical machines running with different compilers or
even just different compiler options. The basic problem
lies in making *data dependent branches* on different
processors, which may branch differently than expected on
different processors, leading to different processors
executing a completely different section of code than the
other processors expect. We give three examples of this
below.

The simplest example is an iteration where the stopping
criterion depends on the machine precision. If the precision
varies from processor to processor, different processors will
have significantly different stopping criteria than others.
In particular, the criterion for the most accurate processor
may never be satisfied if it depends on data computed less
accurately by other processors. Many problems like this can
be eliminated by using the *largest* machine epilson among
all participating processes. Routine `PDLAMCH` returns
this largest value, replacing the uniprocessor `DLAMCH`.
Similarly, one would use the smallest overflow threshold and
largest underflow threshold for other calculations. But this
is not a panacea, as subsequent examples show.

Next, consider the situation where processors sharing a distributed
vector compute its two-norm, and depending on that either scale
by a constant much different from 1, or do not. This happens in
the inner loop of the QR decomposition, for example. The two-norm
is computed by the ScaLAPACK routine `PDNRM2`, which computes
two-norms locally and does a reduction. If the participating processors
have different floating point formats, they may receive different values
of the two-norm on return, just because the same floating point numbers
cannot be represented on all machines. This two-norm is then compared
to a threshold, and if it exceeds the threshold scaling takes place.
Since the two-norm may be different, and the threshold may be different,
the result of the comparison could differ on different processors, so
that one process would scale the sub-vector it owns, and another would
not. This would very likely lead to erroneous results. This could in
principle be corrected by extending the reduction operation `PDNRM2`
to broadcast a discrete value (like the boolean value of a comparison);
then all participating processors would be able to agree with the
processor at the root of the reduction tree.

However, there are still harder problems. Consider bisection for
finding eigenvalues of symmetric matrices. In this algorithm, the
real axis is broken into disjoint intervals to be searched by different
processors for the eigenvalues contained in each. Disjoint intervals
are searched in parallel. The algorithm depends on a function, call
it `count(a,b)`, that counts the number of eigenvalues in the half
open interval [`a`, `b` ). Using `count`, intervals can be
subdivided into smaller intervals containing eigenvalues until the
intervals are narrow enough to declare the eigenvalues they contain
as ``found''. One problem is that two processors with different
floating point formats cannot even agree on the boundary between
their intervals, because they cannot store the same floating point
number. This could result in multiple copies of eigenvalues if
intervals overlap, or missing eigenvalues if there are gaps
between intervals. Furthermore, the `count` function may
count differently on different processors, so an interval
[`a`, `b` ) may be considered to contain 1 eigenvalue
by processor A, but 0 eigenvalues by processor B, which has been
given the interval by processor A during load balancing. This can
happen even if processors A and B are identical, but if their
compilers generate slightly different code sequences for `count`.
We have not yet decided what to do about all of these problems, so we
currently only guarantee correctness of `PDSYEVX` for networks
of processors with identical floating point formats (slightly
different floating point operations are acceptable). See
[4] for details.

Fri Mar 31 13:01:26 EST 1995