The suggestions we have made so far certainly do not solve all of the problems. We are still left with major concerns for problems associated with varying floating point representations and arithmetic operations between different processors, different compilers and different compiler options. We have given one example at the end of Section 3 and we now illustrate the difficulties with three further examples from ScaLAPACK, the second example giving rather more severe difficulties than the first and third.
Many routines in LAPACK and hence also in ScaLAPACK, scale vectors and matrices. The scaling is done to equilibrate or balance a matrix in order to improve its condition, or to avoid harmful underflow, or overflow, or even to improve accuracy by scaling away from subnormal numbers. When scaling occurs we naturally have to ensure that all processes containing elements of the vector or matrix to be scaled, take part in the scaling. Consider the case of a four element vector
distributed over two processors, with the following test for scaling:
As illustrated below, if we let each processor make the decision independently then we risk the danger of one processor scaling, while the other does not.
If this situation occurred the computation would now proceed with the meaningless vector
One way to ensure correct computation is to put one process in control of whether or not scaling should take place, and for that process to communicate the decision to the other processes. Having a controlling process is a common way to solve such problems on heterogeneous networks.
An example of a routine that scales to improve accuracy is the LAPACK routine DLARFG, which computes an elementary reflector (Householder transformation matrix) H such that
where is a scalar, x is an n element vector and is the first column of the unit matrix. H is represented in the form
where is a scalar and v is an n element vector. Since H is orthogonal we see that
If is very small (subnormal or close to being subnormal), DLARFG scales x and recomputes . This computation is at the heart of the LAPACK QR, and other, factorizations (see for example [&make_named_href('', "node12.html#GV:89","[10]")]).
In the case of the equivalent ScaLAPACK routine PDLARFG, x will typically be distributed over several processors, each of which participates in the computation of and, if necessary, scales its portion of the vector x and recomputes . From the previous discussion we can see that we clearly need to take care here, or else, in close cases, some processors may attempt to recompute , while others do not, leading to completely erroneous results, or even deadlock. This care will be exercised when ScaLAPACK is able to call the version of the BLACS that support caching, as discussed at the end of Section 4. The hope is that this will occur for Version 2.0 of ScaLAPACK. We could of course solve the problem now by using the idea mentioned above of a controlling process, but this would involve a rather heavy communication burden, and we prefer to wait until we can use the more efficient solution based upon the BLACS. Although failure is very unlikely and indeed we have not yet been able to find an example that fails without artificially altering the PDLARFG code, the possibility of failure exists.
Whilst we could not find an example that failed without altering the code, we were able to experimentally simulate such a heterogeneous failure, using the current version of ScaLAPACK, by performing the QR factorization of a 6 by 6 matrix A such that
We took , which is on an IEEE machine. The value of sfmin is used in PDLARFG to determine whether or not to scale the vector, and we artificially adjusted the value so that on one of the processes involved in the scaling decision. As expected, the execution of the factorization hung.
As the second, and somewhat harder problem consider the method of bisection for finding the eigenvalues of symmetric matrices performed by the ScaLAPACK routine PDSYEVX. In this algorithm, the real axis is broken into disjoint intervals to be searched by different processes for the eigenvalues contained in each interval. Disjoint intervals are searched in parallel. The algorithm depends on a function, say count(a,b), that counts the number of eigenvalues in the half open interval [a, b ). Using count, intervals can be subdivided into smaller intervals containing eigenvalues until the intervals are narrow enough to declare the eigenvalues they contain as being found. The problem here is that two processors may not agree on the boundary between their intervals. This could result in multiple copies of eigenvalues if intervals overlap, or missing eigenvalues if there are gaps between intervals. Furthermore, the count function may count differently on different processors, so an interval [a, b ) may be considered to contain 1 eigenvalue by processor A, but 0 eigenvalues by processor B, which has been given the interval by processor A during load balancing. This can happen even if processors A and B are identical in hardware terms, but if the compilers on each one generate slightly different code sequences for count. In this example we have not yet decided what to do about all these problems, so we currently only guarantee correctness of PDSYEVX for networks of processors with identical floating point formats (but slightly different floating point operations turn out to be acceptable). See [&make_named_href('', "node12.html#DDR:ETNA:95x","[4]")] for further discussion. Assigning the work by index rather than by range and sorting all the eigenvalues at the end may give the desired result with modest overhead. Of course, if floating point formats differ across processors, sorting is a problem in itself. This requires further investigation.
The symmetric eigensolvers, PDSYEVX and PZHEEVX, may also have trouble on heterogeneous networks when a subset of eigenvalues is chosen by value (i.e. RANGE='V') and one of the limits of that range (VL or VU) is within a couple of units in the last place (ulps) of an actual eigenvalue. The two processors may then disagree on the number of eigenvalues specified by the range VL and VU and the code breaks with each process returning (which is the LAPACK and ScaLAPACK failure indicator). This situation can happen when running the test code and should again be corrected in the next release. In every case that we have seen, the answer is correct despite the spurious error message. This is not a problem on homogeneous systems.
The third example is based upon the idea that some algorithms can perform redundant work in order to gain parallelism. While redundant work on different processors is intended to yield identical results, this may not be the case in a heterogeneous environment. For instance, one approach for parallelizing the symmetric eigenproblem is to perform the tridiagonal QR algorithm to reduce the tridiagonal matrix to diagonal form redundantly on all processors, save the plane rotations, and then accumulate the resulting Givens rotations in parallel into the relevant columns of the unit matrix. This results in redundant work, but parallel work, and requires no communication. Since the QR algorithm is not in general forward stable, slight differences in the underlying arithmetic can lead to completely different rotations and hence the danger of obtaining quite inconsistent eigenvectors. This problem can be solved by having a controlling process that runs the QR algorithm and then broadcasts the plane rotations to the other processes, but the communication cost is substantial: .