The suggestions we have made so far certainly do not solve all of the problems. We are still left with major concerns for problems associated with varying floating point representations and arithmetic operations between different processors, different compilers and different compiler options. We have given one example at the end of Section 3 and we now illustrate the difficulties with three further examples from ScaLAPACK, the second example giving rather more severe difficulties than the first and third.
Many routines in LAPACK and hence also in ScaLAPACK, scale vectors and matrices. The scaling is done to equilibrate or balance a matrix in order to improve its condition, or to avoid harmful underflow, or overflow, or even to improve accuracy by scaling away from subnormal numbers. When scaling occurs we naturally have to ensure that all processes containing elements of the vector or matrix to be scaled, take part in the scaling. Consider the case of a four element vector
distributed over two processors, with the following test for scaling:
As illustrated below, if we let each processor make the decision independently then we risk the danger of one processor scaling, while the other does not.
If this situation occurred the computation would now proceed with the meaningless vector
One way to ensure correct computation is to put one process in control of whether or not scaling should take place, and for that process to communicate the decision to the other processes. Having a controlling process is a common way to solve such problems on heterogeneous networks.
An example of a routine that scales to improve accuracy is the LAPACK routine DLARFG, which computes an elementary reflector (Householder transformation matrix) H such that
where is a scalar, x is an n element vector and
is the
first column of the unit matrix. H is represented in the form
where is a scalar and v is an n element vector. Since H
is orthogonal we see that
If is very small (subnormal or close to being subnormal),
DLARFG scales x and recomputes
. This computation is at the
heart of the LAPACK QR, and other, factorizations (see for example
[&make_named_href('',
"node12.html#GV:89","[10]")]).
In the case of the equivalent ScaLAPACK routine PDLARFG, x will
typically be distributed over several processors, each of which participates in
the computation of and, if necessary, scales its portion of the
vector x and recomputes
. From the previous discussion we can see
that we clearly need to take care here, or else, in close cases, some processors
may attempt to recompute
, while others do not, leading to
completely erroneous results, or even deadlock. This care will be exercised when
ScaLAPACK is able to call the version of the BLACS that support caching, as
discussed at the end of Section 4. The hope is that this will occur
for Version 2.0 of ScaLAPACK. We could of course solve the problem now by using
the idea mentioned above of a controlling process, but this would involve a
rather heavy communication burden, and we prefer to wait until we can use the
more efficient solution based upon the BLACS. Although failure is very unlikely
and indeed we have not yet been able to find an example that fails without
artificially altering the PDLARFG code, the possibility of failure exists.
Whilst we could not find an example that failed without altering the code, we
were able to experimentally simulate such a heterogeneous failure, using the
current version of ScaLAPACK, by performing the QR
factorization of a 6 by 6 matrix A such that
We took , which is
on an IEEE
machine. The value of sfmin is used in PDLARFG to determine whether
or not to scale the vector, and we artificially adjusted the value so that
on one of the processes
involved in the scaling decision. As expected, the execution of the
factorization hung.
As the second, and somewhat harder problem consider the method of bisection for finding the eigenvalues of symmetric matrices performed by the ScaLAPACK routine PDSYEVX. In this algorithm, the real axis is broken into disjoint intervals to be searched by different processes for the eigenvalues contained in each interval. Disjoint intervals are searched in parallel. The algorithm depends on a function, say count(a,b), that counts the number of eigenvalues in the half open interval [a, b ). Using count, intervals can be subdivided into smaller intervals containing eigenvalues until the intervals are narrow enough to declare the eigenvalues they contain as being found. The problem here is that two processors may not agree on the boundary between their intervals. This could result in multiple copies of eigenvalues if intervals overlap, or missing eigenvalues if there are gaps between intervals. Furthermore, the count function may count differently on different processors, so an interval [a, b ) may be considered to contain 1 eigenvalue by processor A, but 0 eigenvalues by processor B, which has been given the interval by processor A during load balancing. This can happen even if processors A and B are identical in hardware terms, but if the compilers on each one generate slightly different code sequences for count. In this example we have not yet decided what to do about all these problems, so we currently only guarantee correctness of PDSYEVX for networks of processors with identical floating point formats (but slightly different floating point operations turn out to be acceptable). See [&make_named_href('', "node12.html#DDR:ETNA:95x","[4]")] for further discussion. Assigning the work by index rather than by range and sorting all the eigenvalues at the end may give the desired result with modest overhead. Of course, if floating point formats differ across processors, sorting is a problem in itself. This requires further investigation.
The symmetric eigensolvers, PDSYEVX and PZHEEVX, may also have
trouble on heterogeneous networks when a subset of eigenvalues is chosen by
value (i.e. RANGE='V') and one of the limits of that range (VL or
VU) is within a couple of units in the last place (ulps) of an actual
eigenvalue. The two processors may then disagree on the number of eigenvalues
specified by the range VL and VU and the code breaks with each
process returning (which is the LAPACK and ScaLAPACK
failure indicator). This situation can happen when running the test code and
should again be corrected in the next release. In every case that we have seen,
the answer is correct despite the spurious error message. This is not a problem
on homogeneous systems.
The third example is based upon the idea that some algorithms can perform
redundant work in order to gain parallelism. While redundant work on different
processors is intended to yield identical results, this may not be the case in a
heterogeneous environment. For instance, one approach for parallelizing the
symmetric eigenproblem is to perform the tridiagonal QR algorithm to reduce
the tridiagonal matrix to diagonal form redundantly on all processors, save the
plane rotations, and then accumulate the resulting Givens rotations in parallel
into the relevant columns of the unit matrix. This results in redundant
work, but
parallel work, and requires no communication. Since the QR
algorithm is not in general forward stable, slight differences in the underlying
arithmetic can lead to completely different rotations and hence the danger of
obtaining quite inconsistent eigenvectors.
This problem can be solved by having a controlling process that runs the QR
algorithm and then broadcasts the plane rotations to the other processes, but
the communication cost is substantial:
.