DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT XBLAS - a Proposal to the BLAST Forum for Extra Precise BLAS James Demmel Greg Henry W. Kahan 4 Sept. 1997 Section 1: Introduction ----------------------- We propose two kinds of extensions to the current BLAS, one that allows mixing of operand matrices of different precisions, and a second kind that allows arithmetic inside the BLAS to be carried out to higher precision than the operands, when doing so confers some advantage. These extensions are motivated by the following facts: A number of important linear algebra algorithms can become more accurate and sometimes faster if internal computations carry range and/or precision wider than carried by input and output arguments. These include linear system solving, least squares problems, and eigenvalue problems. Often the benefits of wider arithmetic cost only a small fractional addition to the total work. For single precision input, the computer's native double precision is a way to achieve these benefits easily on all commercially significant computers, at least when only a few extra-precision operations are needed. Crays, Convex, and their emulators implement 64-bit single in hardware and much slower 128-bit double in software, so if many double precision operations are needed, these machines will slow down significantly. The overwhelming majority of computers on desktops, containing Intel processors or their AMD and Cyrix clones, are designed to run fastest performing arithmetic to the full width, wider than double precision, of their internal registers. These computers confer some benefits of wider arithmetic at little or no performance penalty. Some BLAS on these computers already perform wider arithmetic internally but, without knowing this for sure, programmers cannot exploit it. All computers can simulate quadruple precision or something like it in software at the cost of arithmetic slower than double precision by at worst an order of magnitude. Less slowdown is incurred for a rough doubled-double precision on machines ( IBM RS/6000, PowerPC/Mac, SGI/MIPS R8000, HP PA RISC 2.0 ) with special fused multiply-accumulate instructions. The slowdown is practically negligible for a few algorithms that require very little of this kind of arithmetic to surpass the speeds of competitive algorithms that avoid extra-precise arithmetic entirely. In this proposal we shall explain our motivation in more detail and present detailed specifications for our proposed extra precise BLAS, which we call XBLAS. The greatest challenge to our design is dealing with too many possible combinations of input types (including precisions), output type, and internal precision, not all of which are useful. We must name all useful possibilities, provide reasonable defaults so they are easier to use, make them callable from different languages, and arrange these numerous routines into a smaller number of groups each of which may be automatically generated, tested and even optimized from a few prototypes. Here are three examples to illustrate the opportunities and challenges. First, extra-precisely accumulated matrix-vector products provide an inexpensive way to refine solutions of linear systems so well that the error is small independently of the system's condition number, unless the condition number is extremely big. The cost of the extra-precise arithmetic is negligible for moderate sized and larger matrices. Second, multiplying a complex matrix by a real matrix is so much faster than multiplying two complex matrices that a special LAPACK auxiliary routine CLACRM was invented to do this instead of promoting real to complex and calling CGEMM. This routine is usually the bottleneck in what is currently the fastest Hermitian tridiagonal eigensolver, CSTEDC. Thus allowing a mixture of input types enhances speed. Accuracy can be enhanced too if extra-precise accumulation is allowed. Third, the algorithm in Example 2.1 below needs extra precision in just a few operations, and could obtain that at negligible cost from the 80-bit extended format on Intel/AMD/Cyrix platforms. For portability, it should permit other platforms to use doubled-double instead of extended. The extra precision need only be used inside the BLAS, so no user variables need to be declared in the high precision format. On the other hand, if the user's input data is double, the algorithms in Examples 2.4, 2.5 and 2.7 would require running a high level LAPACK subroutine with variables declared to be in the high precision format; not all compilers support such declarations. What is the right tradeoff between portability and functionality for libraries like LAPACK? To accommodate all these possibilities, we have chosen a simple naming scheme that permits ALL possible combinations of types and precisions to be specified, even uninteresting ones. We do not expect all combinations to be implemented. The naming scheme would be no simpler if we were to explicitly limit the combinations. Our specification prioritizes the most important routines to be implemented, and leaves open the possibility of implementing others as well. In addition, we specify an environmental inquiry function to determine the properties of extra precise BLAS, and whether they exist at all. (For example, the user might be told that no double-double BLAS exist on his Intel platform.) Optional arguments are used judiciously so that the user can specify as little or as much detail about precisions of input arguments and internal precision as desired. For example, a call to matrix multiplication could appear as XDGEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC, PREC ) where D indicates that the precision of the output floating point variable C is double. The only difference between this and the standard BLAS calling sequence is the optional string PREC. If PREC left blank, this would indicate that all input variables are also double, and that the widest indigenous floating point format should be used internally (extended on Pentiums, and double on most other machines). If PREC='E', it indicates that the input variables are also double, and that a format wider than double should be used; on Pentiums this would be extended, and on other machines probably doubled-double. If PREC='ES' it indicates in addition that all the input variables are in fact single precision. If it is simpler or cheaper to carry more precision internally than the user requested, the BLAS implementor is free to do so provided the environmental inquiry routine returns the actual precision used. In this spirit we invite implementors of the standard BLAS library to carry extra precision internally whenever this is insignificantly slower than not carrying extra precision, since this is permitted by the BLAS specification now and usually gives users better results. We also propose that our environmental inquiry report the precision actually used in the standard BLAS, along with other information like the use of Strassen's matrix multiply. Left unsolved is the problem of obtaining from one platform exactly the same result as was generated by another, since this problem is already intractable in the face of BLAS optimized for speed on platforms with diverse memory management, opportunities for concurrency (pipelines), and heterogeneous collections of processors. Though we concentrate on extending the dense BLAS to accommodate extra precision, we believe our general scheme will accommodate other extensions, for example to sparse or parallel BLAS, to interval BLAS, and to other routines we discuss below. The rest of this proposal is organized as follows. Section 2 lists compelling examples to motivate our proposal. Most involve extra-precise or mixed-type versions of existing BLAS, and some involve new routines. Section 3 surveys implementation technicalities for the higher-than-double precisions supported by current hardware. Section 4 describes the naming scheme and calling sequences, including optional arguments and defaults. Section 5 describes the environmental inquiries used to describe the high-precision BLAS, including the accuracy model they must satisfy. Section 6 presents the prioritized list of extra precision BLAS that we propose all developers implement. Section 7 describes how one can automatically generate these routines. Section 8 describes how one can automatically test these routines. Section 9 gives sample implementations of some basic routines. Section 10 will eventually give sample applications showing how useful these routines are. It currently justs discusses how to pick these applications, and what we should expect from compiler writers. Section 11 compares this to a previous proposal, which was part of the BLAS 2 specification [1]. Section 2: Example Applications ------------------------------- We present a variety of examples that illustrate different advantages of our proposal, including improved accuracy, improved speed, and improved reliability. Not all features of our proposal are used by all examples: Examples 2.1, 2.2, 2.4, 2.5, 2.6, 2.7 and 2.9 require higher precision. Examples 2.3, 2.4, and 2.5 require wider range. They can also take additional advantage of IEEE exception handling internally, but do not require it. Examples 2.7 and 2.8 require mixed-type versions of existing BLAS; 2.8 requires no extra precision or range. Most examples have well-understood benefits, but examples 2.6, 2.9 and the last part of example 2.2 are research questions. Most examples use mixed or higher precision versions of existing BLAS, but examples 2.4, 2.5 and 2.7 would require new BLAS (actually the promotion of certain LAPACK or ScaLAPACK routines to BLAS status). They are easy to implement when the input precision is single and the extra precision is double. We include them to illustrate the broader utility of mixed- precision computations. Example 2.1: Iterative Refinement of Linear Systems --------------------------------------------------- Given an LU or other stable factorization of a matrix A, we can solve Ax=b by forward and back substitution. The relative error of this algorithm is bounded by norm(x_computed - x_true) / norm(x_true) <= O( N * cond(A) * macheps ) where cond(A) >= 1 is the condition number of A and N its dimension, and macheps is the machine epsilon for the input and output data, which we call input/output precision. This error bound is not excessively pessimistic; errors about that big are encountered occasionally. The accuracy of the computed solution x may be improved by iterative refinement roughly as follows: repeat Compute the residual r = A*x - b Solve A*d = r for d using the existing factorization Update the solution x = x-d until { the residual is small enough or stops decreasing, or a maximum iteration count is exceeded }. LAPACK currently implements this algorithm entirely in the input/output precision, in routine xGERFS. What it accomplishes, according to a well understood error-analysis, is essentially to replace the condition number cond(A) in the error bound by another possibly smaller one, which can be as small as min cond(D*A) where the minimum is over all diagonal matrices D. In effect, this algorithm compensates, up to a point, for bad row-scaling The residual is never worsened but the solution x, though usually improved, frequently gets worse if this last condition number is very big. An improved version of iterative refinement differs from LAPACK's in two places. First, the residual r = A*x - b is accumulated to at least twice the input/output precision and then, after massive cancellation has taken place, is rounded to input/output precision. Second, refinement is stopped after x appears to have settled down about as much as it ever will. Now another classical error analysis says that the ultimate relative error will be bounded by O(macheps), essentially independent of the condition number and dimension of A. This assumes that N * cond(A) is not comparable with or bigger than 1/macheps (in which case A could be made exactly singular by a few rounding errors in its components). In short, the improved version of iterative refinement almost always produces the correct solution x, and almost always gives fair warning otherwise. This is a much simpler and stronger result than LAPACK provides now. If the improved algorithm's residual r is accumulated to sufficiently more than input/output precision but less than twice as much, its improvement over what LAPACK obtains currently becomes proportionately weakened. In extensive experiments on machines with 80-bit registers, i.e. 11 more bits of precision than double, iterative refinement either returned the solution correct to all but the last few bits, or with at least 10 more correct bits than without refinement [2]. The current LAPACK algorithm does not have to change much to benefit from a more accurate residual: macheps has to be replaced in a few places by a roundoff threshold that reflects the possibly extra-precise accumulation of r; and if this latter threshold is sufficiently smaller than macheps then the iteration's stopping criterion should be changed from testing r to testing x and d, and an estimate of the error in x should be obtained not from the current condition estimator but instead (at lower cost) from the last few vectors d. The indispensable requirement is an extra-precise version of GEMV in the Level 2 BLAS. One iteration of refinement costs only 4*N^2 flops per right-hand side column b. If, as usual, there are only a few columns b and at most a few iterations suffice, their cost is asymptotically negligible compared to the O(N^3) operations required to compute the LU or other factorization in the first place, even if extra-precise operations run somewhat slowly. Therefore, we want to insist that GEMV use extra precision, even if it is expensive. Example 2.2: Iterative refinement of other linear algebra problems ------------------------------------------------------------------ The availability of extra-precise residuals would make it possible to solve many other linear algebra problem more accurately as well. The simplest example is the overdetermined least squares problem choose x to minimize norm(A*x - b) This can be formulated as the linear system [ I A ] * [ -r ] = [ b ] [ A' 0 ] [ x ] = [ 0 ] and refined using the QR decomposition of A from which the initial solution x was obtained. Again, the final error depends much less upon the condition number of A (though the situation is a bit more complicated than a simple linear system). Again, an extra precise GEMV is indispensable, and its cost is asymptotically small provided that A is not too far from square. The foregoing is most appropriate for double precision data A and b, since it only requires extra precision hidden within BLAS. For single precision data Example 2.7 provides an attractive alternative, where we use more mixed precision. Iterative refinement for eigenproblems is still a topic for research encouraged by better results from better mathematical algorithms than have been published in previous years. The indispensable requirement is a matrix residual R = A*X - Y*B = [ A -Y ]*[ X ] [ B ] accumulated extra-precisely and rounded back to input/output precision after massive cancellation has taken place. Sometimes Y = X; sometimes B is diagonal or triangular. Such residuals can be computed from one call upon a GEMM that accumulates extra-precisely, but a lot of waste motion can be avoided if versions of GEMM that accept and produce extra-precise matrices are available too. Initially these versions of GEMM could be simulated by calls upon extra-precise versions of GEMV instead, just to get the ball rolling sooner. Iterative refinement of nonsymmetric eigenproblems, other than Schur decompositions, depends crucially upon how well iterative refinement works for extremely nearly singular linear systems. For best results, the LU factorization should ideally be performed by Crout factorization with extra-precisely accumulated scalar products, but how much good this would do is not yet clear. Exception handling is important too because infinity and 0/0 turn up in the better mathematical algorithms mentioned above unless entirely algebraic computations are replaced in spots by invocations of transcendental functions elementwise upon arrays. The last question remaining is how much precision would be needed to perform an entire eigencalculation by conventional means, without refinement, to obtain results as good and as soon as are obtained from refinement with a little extra-precise arithmetic in the program; attempts to answer this question empirically have occasionally generated astonishment. Example 2.3: Solving Ill-conditioned Triangular Systems ------------------------------------------------------ The LAPACK subroutine SLATRS solves a triangular linear system T*x = scale*b with careful scaling to prevent over/underflow. It is used in place of the Level 2 BLAS routine STRSV when we expect very ill-conditioned triangular systems, and want to get the solution vector (with a scale factor) despite the fact that it may be very large or very small. This is the case with condition estimation (SGECON, SPOCON, STRCON) and inverse iteration to find eigenvectors of nonsymmetric matrices (SLAEIN). SLATRS does careful scaling in the innermost loops to prevent over/underflow, and is significantly slower than an optimized STRSV. Also, it contains about 320 executable statements, (versus 160 for the unoptimized version of STRSV) a measure of the complexity of producing such code. There are three ways to accelerate SLATRS. The first way is to use extra exponent range to guarantee that over/underflow cannot occur for a given number of iterations of the outermost loop of triangular solve: Solve L*x = b ... L is lower triangular for i = 1 to n x(i) = [ b(i) - ( sum_{j=1}^{i-1} l(i,j)*x(j) ) ] / l(i,i) For example, if the entries of L and b are represented in IEEE single, and no diagonal entry of L is denormalized, and native 80-bit Pentium arithmetic is used to implement this algorithm, testing the scaling is needed only once every 64 values of i. In particular, this could be implemented by calling an optimized SGEMV on blocks of size 64, which would get much of the benefit of the optimized SGEMV. Double is not so good, since we could only do blocks of 8. Alternatively, we could use scaling code like that in SLATRS to compute the largest block size we could get away with. This requires information about the norms of columns of L. In the case of SLAEIN, where we call SLATRS many times, the cost of this information is small. In the case of condition estimation, the cost is larger. The second way to accelerate SLATRS was explored in the paper "Faster Numerical Algorithms via Exception Handling" [3]. The IEEE exception flags can be used to hide over/underflow problems as follows: We first run using the optimized STRSV, and then check the exception flags. Since over/underflow is rare, this will almost always indicate that the solution is fine. Only in the rare cases that an exception occurs does one need to recompute more carefully. This is definitely the way to go on Pentium. Speedups reported in [3] ranged from 1.4 to 6.5. The third way to accelerate SLATRS, which could complement the above ways, is to more cleverly estimate how big a block we can call STRSV on, based on some data about sizes of off-diagonal entries of the matrix. SLATRS currently does this in a naive way, and we have explored better alternatives as well. The complex analog CLATRS is used in the above routines as well as in some other LAPACK routines like CTREVC, which computes eigenvectors of triangular matrices. Speedups of CTREVC using exception handling reported in [3] ranged from 1.38 to 1.65. The overall routine CGEEV for the complex nonsymmetric eigenproblem sped up by 8%, since CTREVC is just part of the calculation. Example 2.4: Eigenvalue calculation using Bisection --------------------------------------------------- This is an inner loop in the ScaLAPACK routine PDLAIECT to compute eigenvalues of symmetric matrices. The simplest version is as follows Count the number of eigenvalues less than x of the symmetric tridiagonal matrix T with diagonal entries a(1:n) and off diagonals b(1:n-1) count = 0 d(0) = 1 for i = 1 to n d(i) = a(i) - x - b(i-1)^2/d(i-1) ... b(0) = 0 if d(i)<0, count = count + 1 In practice d(i) overwrites d(i-1), and we replace the inner loop by d = a(i)-x - b(i-1)^2/d. The trouble with this is that if d is small or zero, overflow occurs. The standard fix is to test to see if d is too small, and explicitly set it to a tiny safe value if it is: for i = 1 to n d = a(i) - x - b(i-1)^2/d if (abs(d) < too_small) d = -too_small if d<0, count = count + 1 A faster way used by PDLAIECT if the user is on an machine implementing IEEE infinity arithmetic (i.e. most of the time) is to go ahead and compute an infinite d, since the next time through the loop b(i-1)^2/d will be zero, and the recurrence will continue uneventfully. This also requires updating count slightly differently, to account for -0s. Speedups reported in [3] range from 1.18 to 1.47. for i = 1 to n d = a(i) - x - b(i-1)^2/d if count = count + signbit(d) One can further accelerate this recurrence by reorganizing it to remove the division, which can cost 6 times or more than the other floating point operations. This loop looks like for i = 1 to n p(i) = (a(i)-x)*p(i-1) - b(i-1)^2*p(i-2) if (p(i) and p(i-1) have opposite signs) count = count + 1 The relationship between these loops is that d(i) = p(i)/p(i-1). Again we can remove the subscript on p(i) if we use two temporaries: for i = 1 to n step 2 p = (a(i)-x)*p1 - b(i-1)^2*p2 if (p and p1 have opposite signs) count = count + 1 p1 = (a(i+1)-x)*p - b(i)^2*p1 if (p and p1 have opposite signs) count = count + 1 p2 = p This loop has no division, and so may be much faster than the first one. Not all machines have such slow division, so this will be machine dependent. However this loop is much more susceptible to over/underflow, because p(i) is the determinant of the leading i-by-i submatrix of T-x*I, and determinants easily over/underflow. For example det(x*eye(n)) = x^n in Matlab notation, and neither x or n has to be very large for overflow to occur. We again have two approaches as in Example 2.3: use extra exponent range in the inner loop (which can reduce the number of scaling tests in the inner loop to one every 128 iterations in single or 16 in double), or IEEE exception handling. There are various ways to accelerate bisection by using Newton's method or related ideas, but all involve inner loops similar to the one above. [many refs] This computation could not be done with extra precise versions of existing BLAS alone, but would require new code with a small number of explicitly declared extra precise variables. If PDLAIECT is "promoted" to a BLAS, these could be hidden from the user. Example 2.5: Orthogonal Eigenvector calculation with Inverse Iteration --------------------------------------------------------------------- Given the eigenvalues computed from bisection, the standard algorithm to find their corresponding eigenvectors is inverse iteration, which involves solving a tridiagonal system of linear equations. In exact arithmetic, this algorithm would cost O(n) work per eigenvector, i.e. an optimal O(1) work per solution component, and be embarrassingly parallel. Until very recently, roundoff (from reorthogonalization) has made this algorithm cost as much as O(n^2) per eigenvector and be serial, depending on the eigenvalue distribution. Recent work by Parlett and Dhillon promises to attain the promised complexity of the infinitely precise algorithm, and using only input/output precision. Currently their algorithm works better than they can explain, but a complete proof may remain elusive for a while. In the meantime, it is straightforward to drastically reduce the incidence of the "hard cases" (when eigenvalues are tightly clustered) by using higher intermediate precision. Briefly, if computations are done in b bits, and an eigenvalue shares sFrom the last section, we can see the possibility of the following 6 additional types (which may or may not be supported by a particular compiler) Table 3: Possible High Precision Data Types X (extended precision real) Y (extended precision complex) W (doubled-double precision real) U (doubled-double precision complex) Q (quadruple precision real) P (quadruple precision complex) As mentioned in section 3, one can imagine that type X takes 10-bytes to store, but that it would be more efficient to store it on 12-byte or 16-byte boundaries because of the memory system. We assume there is only one way to store it per machine, and do not distinguish among these, though they could differ from machine to machine. In principle, the type of each input and output parameter could be specified independently. Independent of the input and output types, we could specify the internal precision to be used during the matrix multiplication itself. Clearly the following are all possibilities: Table 4: Possible Internal Precisions - Preliminary List S (single precision) D (double precision) X (extended precision) W (doubled-double precision) Q (quadruple precision) However, there are several other possibilities we should consider. First, for most calculations we would be happiest if the internal precision were the widest one done in hardware (i.e. fast) by the machine, which we call "native arithmetic". On the Pentium and its clones, this means precision X (80-bit arithmetic). On other machines (from SUN, HP, IBM and DEC) it means precision D (double precision). So we might want to designate internal precision I for indigenous arithmetic, which would be either X or D depending on the machine. Second, for iterative refinement and some other applications where extra precision is essential and its cost low, we want to use whatever extra-precise arithmetic is available, and do not care whether it is X (on Pentiums) or W (on most other machines). So we might want to designate E for Extra precision, which would be either X or W, depending on the machine. But these are still not the only possibilities that one might want. The expression tree for xGEMM is ALPHA*(op_1(A)*op_2(B)) + BETA*C | | + / \ / \ / \ / \ * * / \ / \ / \ / \ ALPHA * BETA C / \ / \ / \ op_1(A) op_2(B) The traditional way to evaluate this or any other expression in Fortran or C is to use the wider precision of either operand of any binary operator. So if ALPHA, A, B, and C are single precision real, but BETA is double precision, BETA*C and the addition will be done in double precision, and the other two multiplications in single. We denote this so-called "strict" evaluation scheme by T for traditional (since S is already taken). Finally, a perhaps more natural way to evaluate such a mixed precision expression is to do all operations in the widest precision of any type in the expression. In the example of the last paragraph, this means all operations would be in double precision. This presumes the existence of a total ordering on all the precisions in Table 4. A possible difficulty here is that if a machine has both X and W arithmetic, it is conceivable that X might have less precision but a wider exponent range than W, making them hard to order. However, we doubt that any machine would have 3 distinct precisions D, X and W that were in addition not totally ordered, and in any case leave it to the implementation to impose a total order. The natural letter to designate this "evaluate to widest" scheme is W, which is already taken by doubled-double, so we use M for "evaluate using the Most precise type appearing." Finally, we note that a few machines like the IBM 4361 actually support exact inner products. This entails simulating a register wide enough (hundreds or thousands of bits) so that the square of the smallest floating point number can be added to the square of the largest floating point number without roundoff. We mention this possibility not because it is common or because we recommend it, but to list all the possibilities that we might want to accommodate. We indicate this internal precision by L for Long, since E (for Exact) is already taken. Altogether, this yields the following table of possible "internal precisions": Table 5: Possible Internal Precisions - Final List S (single precision) D (double precision) I (indigenous precision, either D or X) X (extended precision) E (extra precision, either X or W) W (doubled-double precision) Q (quadruple precision) T (traditional) M (most precise type appearing in expression) L (long) Finally, we can present our proposed calling sequence for extended precision matrix multiplication: XxGEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC, PREC ) The differences between this calling sequence and the one for xGEMM above are as follows: 1) The first letter is X, and the prefix x can be taken from either Table 2 or Table 3. x specifies the precision of the output variable (in this case the matrix C). Not all precisions from Table 3 may be implemented. 2) The final (optional) parameter PREC is a character string specifying the internal precision and input variable types. It is interpreted as follows. 2.1) If PREC is a null string (or missing), then the types of all the input variables default to be the same as the output variable as specified by x. The internal precision must be AT LEAST as high as the precision associated with x. The actual precision is available from the enironmental enquiry specified below. Note that this assumes a total ordering on the precisions available in Table 5 (except for T and M, which are equivalent to one of the other precisions in this case, since all input and output types are identical to x). 2.2) If PREC contains a single character, it must be from Table 5 (otherwise it is an error) and specifies the internal precision. Again, the implementation is free to use at least as much precision as specified, as reported by the environmental enquiry routine. Since the output and input precisions are all the same (x), choosing either PREC=T or PREC=M is equivalent to choosing PREC=x. 2.3) If PREC contains more characters, these specify the input types, associating with the input variables from left to right. For example PREC = "XSDCZQ" means that the internal precision is (at least) X, and the types of ALPHA, A, B, BETA and C are S, D, C, Z and Q, respectively. If there are fewer characters than input variables, then the rightmost character applies to the remaining input variables. Thus XDGEMM( ... , 'DS' ) returns a double precision product from single precision inputs, using (at least) double precision internally. We believe that this scheme requires the least change (a single extra character) to get most of the desired functionality, but permits detailed control over precision if desired. If a system does not permit the user to declare variables of type Q, say, then this will be obvious at load time because XQGEMM will not exist. If the user asks for internal precision W but the system better supports the wider precision Q, then the implementor is free to use Q instead. One possible exception to the rule that the system may use precision at least as high as requested by the user might be T. We cannot think of a situation where the user would explicitly specify traditional evaluation, as well as differing individual input precisions, without wanting this specific evaluation scheme. What about C++, Java, F90? Section 5: Environmental Enquiries ---------------------------------- We propose to extend xLAMCH to handle enquiries about the internal precision used. Ideally, one would use the same routine name with extra arguments, which if present, would have the meaning specified below. If this is not possible or desirable, then we would use XxLAMCH instead. The calling sequence is as follows: xLAMCH( CMACH, PREC, ROUTINE ) CMACH as the same meaning as in the current xLAMCH, except that the arguments S (safe minimum, smallest number such that 1/safmin does not overflow) U (underflow threshold) O (overflow threshold) are not permitted, because their values may not be representable in precision x (for example, suppose the internal precision is D and x=S). Information about these values is available from arguments M (exponent of U) and L (exponent of O). This way, all data about the widest precision will be representable in the narrowest precision. PREC has the same meaning as in the last section. If the precision is not implemented, xLAMCH returns -1 for any CMACH. ROUTINE, if present, is a string with the name of the routine being asked about (eg 'XSGEMM'). This provides a way to ask about the implementation of a particular routine. If the routine is not implemented with the precision PREC, xLAMCH returns -1 for any CMACH. If ROUTINE is not present, only the first character of PREC is relevant. If ROUTINE is present, the other characters of PREC are interpreted as in the last section. Here is what the values returned by xLAMCH mean. If xLAMCH( CMACH, PREC, 'XxGEMM' ) returns EPSz, BASEz, EMINz and EMAXz for internal precision PREC, and xLAMCH( CMACH, 'x' ) returns EPSx, BASEx, EMINx and EMAXx for output type x, then this means that XxGEMM satisfies the following error bound. Let Uz = BASEz**(EMINz-1) ... underflow threshold for internal precision Ux = BASEx**(EMINx-1) ... underflow threshold for input/output type Oz = BASEz**(EMAXz-1)*(1-EPSz) ... overflow threshold for internal precision Ox = BASEx**(EMAXx-1)*(1-EPSx) ... overflow threshold for input/output type Then the error bound is | computed C(i,j) - true C(i,j) | <= (K+2)*EPSz* \sum_{k=1}^K |ALPHA*op_1(A)(i,k)*op_2(B)(k,j)| +(K+1)*EPSz* |BETA*initial C(i,j)| ... roundoff from internal precision +EPSx* |computed C(i,j)| ... roundoff from rounding result to output precision +(2*K*|ALPHA|+3)*Uz ... underflow from internal precision +Ux ... underflow from rounding result to output precision This bound reflects the following assumptions: 1) Strassen's or other fast methods may not be used, but the terms in the dot products may be accumulated in any order. 2) The parentheses in ALPHA*(op_1(A)*op_2(B)) are respected. 3) Gradual Underflow or Flush-to-Zero is permitted. 4) No intermediate value exceeded Oz in magnitude, and the final result does not exceed Ox in magnitude. Analogous formulas exist for other BLAS. Since the standard BLAS library is not forbidden from using extra precision, we propose that this enquiry also apply to the standard BLAS. In other words, one could equally well call xLAMCH(CMACH,'SGEMM') as xLAMCH(CMACH,'XSGEMM','X'), say. Furthermore, we encourage developers to use higher precision if the performance penalty is insignificant. This gives us added flexibility in enquiring about the internals of the BLAS. For example, we could have an argument CMACH='A' and return data about the algorithm used. The return argument might be 1 if standard matrix multiply used, with order of summation is guaranteed to be in some particular order. 2,3, ... same as 1 with other orders, or no order guaranteed 4 Strassen's method Section 6: Prioritized List of Extra Precision BLAS --------------------------------------------------- In this section we described the most important extended precision BLAS to implement. XxDOT(N, X, INCX, Y, INCY, PREC) The dot product is the basis of most other high precision operations. Initially just a single character PREC would be supported, but eventually it should be possible to compute and return the result to higher precision than the inputs, to support using the normal equations for least squares problems. Also, permitting one input to be real and the other complex would support Hermitian eigenvalue algorithms based on divide-and-conquer; a simple implementation would simply consist of two calls to the existing BLAS xDOT. XxDOTG(TRANS, N, ALPHA, X, INCX, Y, INCY, BETA, C, PREC ) This implements C <- ALPHA*sum_i op(X(1+i*INCX))*Y(1+i*INCY) + BETA*C and is intended to be the natural building block for XxGEMV and XxGEMM below. The G in DOTG stands for "general". XxGEMV( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY, PREC ) Matrix-vector multiplication is the most natural routine with which to implement iterative refinement as described in section 2. It cannot be implemented directly using XxDOT for general ALPHA and BETA, which is why we proposed XxDOTG. A sparse version of this routine (or of DOT), where A is lower precision than X and Y, would support Lanczos. XxTRSV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX, SCALE, PREC ) Solution of triangular systems can also be based on XxDOT, but can take special advantage of exception handling as described above. The routine solves op(A)*Y=SCALE*X (where the solution Y overwrites X), and SCALE is chosen to avoid over/underflow. Alternatively, the name and calling sequence of the LAPACK auxiliary routine xLATRS could be used instead. Section 7: Automatic Generation of These Routines ------------------------------------------------- We assume that C is our implementation language. In this case, all the floating point variables would be passed in as void*. There would be a big case statement at the beginning based on PREC to branch to right set of nested loops. Each set of loops would differ only in the casts of the void* variables, and could be generated automatically (optimization would require use of PHIPAC or a similar system). What about C++, Java, F90? Section 8: Automatic Testing of These Routines ---------------------------------------------- We could generate test cases whose answers we know exactly as a function of the intermediate precision. These matrices would be chosen to to give different answers as a function of the input precision, and so serve as a signature of the precision. The same examples could be used to implement xLAMCH. Section 9: Sample Implementations ---------------------------------- We could just do XxDOTG in C assuming single precision input as an illustration. This should be quite short. Section 10: Sample Applications ------------------------------- The first example to do is iterative refinement for linear systems and least squares, since the effort is so small and the payoff so large. This is particularly true for single precision input. We would code these without respecting the detailed specifications above, just quick-and-dirty to measure the speed and accuracy. Some of the greatest benefits described above for least squares and the symmetric eigenvalue problem will require new routines not currently part of the BLAS. Writing them would require a few variables that are higher precision and/or range than the input/output variables. When the input variables are single, we can simply use double, and this should be the next thing we do. When the input is double, we have a problem, because most compilers do not support the declaration of higher precision variables. This is even true on Pentiums, which have hardware extended precision. How should we write such code? One way is to propose new BLAS like XXMUL(A,B,C) which would compute the product C=A*B of extended precision variables, where A, B and C are arrays of appropriate length. Given similar routines for add, subtract, divide, sqrt and comparison, we could essentially write all our routines in assembly language. This approach has two big drawbacks: it encourages us to produce inscrutable, machine-dependent, hard-to-maintain and hard-to-port code, i.e. life before IEEE 754. Second, it largely removes pressure from compiler writers to accommodate us. The second way is to just write relatively simple code using double as the extended precision when the input is single, and perhaps one impressive (if painful) benchmark with double precision input, to hold out a carrot to compiler writers to provide appropriate support for extended variables. So we might write XxMUL , XxADD, XxSUB, XxDIV, XxSQRT, XxCMP, XxCOPY for our own use, but not propose them as a standard. Section 11: Comparison to a previous proposal --------------------------------------------- Appendix B of the BLAS2 paper [2] specifies a set of extra-precise BLAS, which the authors called EBLAS. This proposal naturally has a lot in common with this earlier proposal, but there are some important differences: the earlier proposal made several more limiting assumptions about uses of the BLAS, machine architectures, and compilers than we have here. They assumed that it was necessary to have at least one extended precision argument if the internal precision were to be extended. In other words, not all uses of extra precision could be hidden in the BLAS, so the ability to declare REAL*16 and COMPLEX*32 variables was required. In contrast, we can get a lot of use out of extra precision BLAS in cases where no such declarations need to be made (iterative refinement). Furthermore, they did not specify environmental enquiries, which are important to get error bounds. We believe our current updated proposal better matches current understanding of algorithms and the limitations and opportunities of modern architectures and compilers. Bibliography ------------ [1] Dongarra, J. and Du Croz, J. and Hammarling, S. and Hanson, Richard J., "An Extended Set of FORTRAN Basic Linear Algebra Subroutines", ACM TOMS, March 1988 [2] W. Kahan, M. Ivory, personal communication [3] J. Demmel, X. Li, "Faster Numerical Algorithms via Exception Handling", IEEE Trans. Comp. v. 43, n.8, 1994