30 September 1994
This work is dedicated to Jim Wilkinson whose ideas and spirit have given us inspiration and influenced the project at every turn.
The printed version of LAPACK Users' Guide, Second Edition will be available from SIAM in February 1995. The list price is $28.50 and the SIAM Member Price is $22.80.Contact SIAM for additional information.
The royalties from the sales of this book are being placed in a fund
to help students attend SIAM meetings and other SIAM related activities.
This fund is administered by SIAM and qualified individuals are encouraged to
write directly to SIAM for guidelines.
LAPACK has been designed to supersede LINPACK [26] and EISPACK [44] [70] , principally by restructuring the software to achieve much greater efficiency, where possible, on modern high-performance computers; also by adding extra functionality, by using some new or improved algorithms, and by integrating the two sets of algorithms into a unified package.
Appendix D lists the LAPACK counterparts of LINPACK and EISPACK routines. Not all the facilities of LINPACK and EISPACK are covered by Release 2.0 of LAPACK.
The argument lists of all LAPACK routines conform to a single set of conventions for their design and documentation.
Specifications of all LAPACK driver and computational routines are given in Part 2. These are derived from the specifications given in the leading comments in the code, but in Part 2 the specifications for real and complex versions of each routine have been merged, in order to save space.
The documentation of each LAPACK routine includes:
Arguments of an LAPACK routine appear in the following order:
The style of the argument descriptions is illustrated by the following example:
The description of each argument gives:
Arguments specifying options are usually of type CHARACTER*1. The meaning of each valid value is given , as in this example:
The corresponding lower-case characters may be supplied (with the same meaning), but any other value is illegal (see subsection 5.1.8).
A longer character string can be passed as the actual argument, making the calling program more readable, but only the first character is significant; this is a standard feature of Fortran 77. For example:
CALL SPOTRS('upper', . . . )
It is permissible for the problem dimensions to be passed as zero, in which case the computation (or part of it) is skipped. Negative dimensions are regarded as erroneous.
Each two-dimensional array argument is immediately followed in the argument list by its leading dimension , whose name has the form LD<array-name>. For example:
It should be assumed, unless stated otherwise, that vectors and
matrices are stored in one- and two-dimensional arrays in the
conventional manner. That is, if an array X of dimension (N) holds
a vector
, then X(i) holds
for
i = 1,..., n.
If a two-dimensional array A of dimension (LDA,N) holds an m-by-n
matrix A,
then A(i,j) holds
for i = 1,..., m and
j = 1,..., n (LDA must be at least m).
See Section
5.3 for more about
storage of matrices.
Note that array arguments are usually declared in the software as assumed-size arrays (last dimension *), for example:
REAL A( LDA, * )although the documentation gives the dimensions as (LDA,N). The latter form is more informative since it specifies the required minimum value of the last dimension. However an assumed-size array declaration has been used in the software, in order to overcome some limitations in the Fortran 77 standard. In particular it allows the routine to be called when the relevant dimension (N, in this case) is zero. However actual array dimensions in the calling program must be at least 1 (LDA in this example).
Many LAPACK routines require one or more work arrays to be passed as arguments. The name of a work array is usually WORK - sometimes IWORK, RWORK or BWORK to distinguish work arrays of integer, real or logical (Boolean) type.
Occasionally the first element of a work array is used to return some useful information: in such cases, the argument is described as (workspace/output) instead of simply (workspace).
A number of routines implementing block algorithms require workspace sufficient to hold one block of rows or columns of the matrix, for example, workspace of size n-by-nb, where nb is the block size. In such cases, the actual declared length of the work array must be passed as a separate argument LWORK , which immediately follows WORK in the argument-list.
See Section 5.2 for further explanation.
All documented routines have a diagnostic argument INFO that indicates the success or failure of the computation, as follows:
All driver and auxiliary routines check that input arguments such as N or LDA or option arguments of type character have permitted values. If an illegal value of the i-th argument is detected, the routine sets INFO = -i, and then calls an error-handling routine XERBLA.
The standard version of XERBLA issues an error message and halts execution, so that no LAPACK routine would ever return to the calling program with INFO < 0. However, this might occur if a non-standard version of XERBLA is used.
LAPACK routines that implement block algorithms need to determine what block size to use. The intention behind the design of LAPACK is that the choice of block size should be hidden from users as much as possible, but at the same time easily accessible to installers of the package when tuning LAPACK for a particular machine.
LAPACK routines call an auxiliary enquiry function ILAENV , which returns the optimal block size to be used, as well as other parameters. The version of ILAENV supplied with the package contains default values that led to good behavior over a reasonable number of our test machines, but to achieve optimal performance, it may be beneficial to tune ILAENV for your particular machine environment. Ideally a distinct implementation of ILAENV is needed for each machine environment (see also Chapter 6). The optimal block size may also depend on the routine, the combination of option arguments (if any), and the problem dimensions.
If ILAENV returns a block size of 1, then the routine performs the unblocked algorithm, calling Level 2 BLAS, and makes no calls to Level 3 BLAS.
Some LAPACK routines require a work array whose size is proportional to the block size (see subsection 5.1.7). The actual length of the work array is supplied as an argument LWORK. The description of the arguments WORK and LWORK typically goes as follows:
The routine determines the block size to be used by the following steps:
The minimum value of LWORK that would be needed to use the optimal block size, is returned in WORK(1).
Thus, the routine uses the largest block size allowed by the amount of workspace supplied, as long as this is likely to give better performance than the unblocked algorithm. WORK(1) is not always a simple formula in terms of N and NB.
The specification of LWORK gives the minimum value for the routine to return correct results. If the supplied value is less than the minimum - indicating that there is insufficient workspace to perform the unblocked algorithm - the value of LWORK is regarded as an illegal value, and is treated like any other illegal argument value (see subsection 5.1.8).
If in doubt about how much workspace to supply, users should supply a generous amount (assume a block size of 64, say), and then examine the value of WORK(1) on exit.
LAPACK routines are written so that as much as possible of the computation is performed by calls to the Basic Linear Algebra Subprograms (BLAS) [28] [30] [58] . Highly efficient machine-specific implementations of the BLAS are available for many modern high-performance computers. The BLAS enable LAPACK routines to achieve high performance with portable code. The methodology for constructing LAPACK routines in terms of calls to the BLAS is described in Chapter 3.
The BLAS are not strictly speaking part of LAPACK, but Fortran 77 code for the BLAS is distributed with LAPACK, or can be obtained separately from netlib (see below). This code constitutes the ``model implementation'' [27] [29].
The model implementation is not expected to perform as well as a specially tuned implementation on most high-performance computers - on some machines it may give much worse performance - but it allows users to run LAPACK codes on machines that do not offer any other implementation of the BLAS.
LAPACK allows the following different storage schemes for matrices:
These storage schemes are compatible with those used in LINPACK and the BLAS, but EISPACK uses incompatible schemes for band and tridiagonal matrices.
In the examples below, indicates an array element that need not be set and is not referenced by LAPACK routines. Elements that ``need not be set'' are never read, written to, or otherwise accessed by the LAPACK routines. The examples illustrate only the relevant part of the arrays; array arguments may of course have additional rows or columns, according to the usual rules for passing array arguments in Fortran 77.
The default scheme for storing matrices is the obvious one described in subsection 5.1.6: a matrix A is stored in a two-dimensional array A, with matrix element stored in array element A(i,j).
If a matrix is triangular (upper or lower, as specified by the argument UPLO), only the elements of the relevant triangle are accessed. The remaining elements of the array need not be set. Such elements are indicated by * in the examples below. For example, when n = 4:
Similarly, if the matrix is upper Hessenberg, elements below the first subdiagonal need not be set.
Routines that handle symmetric or Hermitian matrices allow for either the upper or lower triangle of the matrix (as specified by UPLO) to be stored in the corresponding elements of the array; the remaining elements of the array need not be set. For example, when n = 4:
Symmetric, Hermitian or triangular matrices may be stored more compactly , if the relevant triangle (again as specified by UPLO) is packed by columns in a one-dimensional array. In LAPACK, arrays that hold matrices in packed storage, have names ending in `P'. So:
For example:
Note that for real or complex symmetric matrices, packing the upper triangle by columns is equivalent to packing the lower triangle by rows; packing the lower triangle by columns is equivalent to packing the upper triangle by rows. For complex Hermitian matrices, packing the upper triangle by columns is equivalent to packing the conjugate of the lower triangle by rows; packing the lower triangle by columns is equivalent to packing the conjugate of the upper triangle by rows.
An m-by-n band matrix with kl subdiagonals and ku superdiagonals may be stored compactly in a two-dimensional array with kl + ku + 1 rows and n columns. Columns of the matrix are stored in corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. This storage scheme should be used in practice only if kl , ku << min(m , n), although LAPACK routines work correctly for all values of kl and ku. In LAPACK, arrays that hold matrices in band storage have names ending in `B'.
To be precise, is stored in AB(ku + 1 + i - j , j) for max(1 , j - ku) < = i < = min(m , j + kl). For example, when m = n = 5, kl = 2 and ku = 1:
The elements marked * in the upper left and lower right corners of the array AB need not be set, and are not referenced by LAPACK routines.
Note: when a band matrix is supplied for LU factorization, space must be allowed to store an additional kl superdiagonals, generated by fill-in as a result of row interchanges. This means that the matrix is stored according to the above scheme, but with kl + ku superdiagonals.
Triangular band matrices are stored in the same format, with either kl = 0 if upper triangular, or ku = 0 if lower triangular.
For symmetric or Hermitian band matrices with kd subdiagonals or superdiagonals, only the upper or lower triangle (as specified by UPLO) need be stored:
For example, when n = 5 and kd = 2:
EISPACK routines use a different storage scheme for band matrices, in which rows of the matrix are stored in corresponding rows of the array, and diagonals of the matrix are stored in columns of the array (see Appendix D).
An unsymmetric tridiagonal matrix of order n is stored in three one-dimensional arrays, one of length n containing the diagonal elements, and two of length n - 1 containing the subdiagonal and superdiagonal elements in elements 1 : n - 1.
A symmetric tridiagonal or bidiagonal matrix is stored in two one-dimensional arrays, one of length n containing the diagonal elements, and one of length n containing the off-diagonal elements. (EISPACK routines store the off-diagonal elements in elements 2 : n of a vector of length n.)
Some LAPACK routines have an option to handle unit triangular matrices (that is, triangular matrices with diagonal elements = 1). This option is specified by an argument DIAG . If DIAG = 'U' (Unit triangular), the diagonal elements of the matrix need not be stored, and the corresponding array elements are not referenced by the LAPACK routines. The storage scheme for the rest of the matrix (whether conventional, packed or band) remains unchanged, as described in subsections 5.3.1, 5.3.2 and 5.3.3.
Complex Hermitian matrices have diagonal matrices that are by definition purely real. In addition, some complex triangular matrices computed by LAPACK routines are defined by the algorithm to have real diagonal elements - in Cholesky or QR factorization, for example.
If such matrices are supplied as input to LAPACK routines, the imaginary parts of the diagonal elements are not referenced, but are assumed to be zero. If such matrices are returned as output by LAPACK routines, the computed imaginary parts are explicitly set to zero.
A real orthogonal or complex unitary matrix (usually denoted Q) is often represented in LAPACK as a product of elementary reflectors - also referred to as elementary Householder matrices (usually denoted ). For example,
Most users need not be aware of the details, because LAPACK routines are provided to work with this representation:
The following further details may occasionally be useful.
An elementary reflector (or elementary Householder matrix) H of order n is a unitary matrix of the form
where is a scalar, and v is an n-vector, with ); v is often referred to as the Householder vector . Often v has several leading or trailing zero elements, but for the purpose of this discussion assume that H has no such special structure.
There is some redundancy in the representation ( 5.1), which can be removed in various ways. The representation used in LAPACK (which differs from those used in LINPACK or EISPACK) sets ; hence need not be stored. In real arithmetic, , except that implies H = I.
In complex arithmetic , may be complex, and satisfies and . Thus a complex H is not Hermitian (as it is in other representations), but it is unitary, which is the important property. The advantage of allowing to be complex is that, given an arbitrary complex vector x, H can be computed so that
with real . This is useful, for example, when reducing a complex Hermitian matrix to real symmetric tridiagonal form , or a complex rectangular matrix to real bidiagonal form .
For further details, see Lehoucq [59].
For anyone who obtains the complete LAPACK package from netlib or NAG (see Chapter 1), a comprehensive installation guide is provided. We recommend installation of the complete package as the most convenient and reliable way to make LAPACK available.
People who obtain copies of a few LAPACK routines from netlib need to be aware of the following points:
Some compilers provide DOUBLE COMPLEX as an alternative to COMPLEX*16, and an intrinsic function DREAL instead of DBLE to return the real part of a COMPLEX*16 argument. If the compiler does not accept the constructs used in LAPACK, the installer will have to modify the code: for example, globally change COMPLEX*16 to DOUBLE COMPLEX, or selectively change DBLE to DREAL.
This Users' Guide gives an informal introduction to the design of the package, and a detailed description of its contents. Chapter 5 explains the conventions used in the software and documentation. Part 2 contains complete specifications of all the driver routines and computational routines. These specifications have been derived from the leading comments in the source text.
On-line manpages (troff files) for LAPACK routines, as well as for most of the BLAS routines, are available on netlib. These files are automatically generated at the time of each release. For more information, see the manpages.tar.z entry on the lapack index on netlib.
Machine-dependent parameters such as the block size are set by calls to an inquiry function which may be set with different values on each machine. The declaration of the environment inquiry function is
INTEGER FUNCTION ILAENV( ISPEC, NAME, OPTS, N1, N2, N3, N4 )where ISPEC, N1, N2, N3, and N4 are integer variables and NAME and OPTS are CHARACTER*(*). NAME specifies the subroutine name: OPTS is a character string of options to the subroutine; and N1-N4 are the problem dimensions. ISPEC specifies the parameter to be returned; the following values are currently used in LAPACK:
ISPEC = 1: NB, optimal block size = 2: NBMIN, minimum block size for the block routine to be used = 3: NX, crossover point (in a block routine, for N < NX, an un blocked routine should be used) = 4: NS, number of shifts = 6: NXSVD is the threshold point for which the QR factorization is performed prior to reduction to bidiagonal form. If M > NXSVD * N, then a QR factorization is performed. = 8: MAXB, crossover point for block multishift QR
The three block size parameters, NB, NBMIN, and NX, are used in many different subroutines (see Table 6.1). NS and MAXB are used in the block multishift QR algorithm, xHSEQR. NXSVD is used in the driver routines xGELSS and xGESVD.
Table 6.1: Use of the block parameters NB, NBMIN, and NX in LAPACK
The LAPACK testing and timing programs use a special version of ILAENV where the parameters are set via a COMMON block interface. This is convenient for experimenting with different values of, say, the block size in order to exercise different parts of the code and to compare the relative performance of different parameter values.
The LAPACK timing programs were designed to collect data for all the routines in Table 6.1. The range of problem sizes needed to determine the optimal block size or crossover point is machine-dependent, but the input files provided with the LAPACK test and timing package can be used as a starting point. For subroutines that require a crossover point, it is best to start by finding the best block size with the crossover point set to 0, and then to locate the point at which the performance of the unblocked algorithm is beaten by the block algorithm. The best crossover point will be somewhat smaller than the point where the curves for the unblocked and blocked methods cross.
For example, for SGEQRF on a single processor of a CRAY-2, NB = 32 was observed to be a good block size , and the performance of the block algorithm with this block size surpasses the unblocked algorithm for square matrices between N = 176 and N = 192. Experiments with crossover points from 64 to 192 found that NX = 128 was a good choice, although the results for NX from 3*NB to 5*NB are broadly similar. This means that matrices with N < = 128 should use the unblocked algorithm, and for N > 128 block updates should be used until the remaining submatrix has order less than 128. The performance of the unblocked (NB = 1) and blocked (NB = 32) algorithms for SGEQRF and for the blocked algorithm with a crossover point of 128 are compared in Figure 6.1.
Figure 6.1: QR factorization on CRAY-2 (1 processor)
By experimenting with small values of the block size, it should be straightforward to choose NBMIN, the smallest block size that gives a performance improvement over the unblocked algorithm. Note that on some machines, the optimal block size may be 1 (the unblocked algorithm gives the best performance); in this case, the choice of NBMIN is arbitrary. The prototype version of ILAENV sets NBMIN to 2, so that blocking is always done, even though this could lead to poor performance from a block routine if insufficient workspace is supplied (see chapter 7).
Complicating the determination of optimal parameters is the fact that
the orthogonal factorization routines and SGEBRD
accept non-square
matrices as input.
The LAPACK timing program allows M and N to be varied independently.
We have found the optimal block size to be
generally insensitive to the shape of the matrix,
but the crossover point is more dependent on the matrix shape.
For example, if
M >> N in the QR factorization, block updates
may always be faster than unblocked updates on the remaining submatrix,
so one might set NX = NB if M > = 2N.
Parameter values for the number of shifts, etc. used to tune the block multishift QR algorithm can be varied from the input files to the eigenvalue timing program. In particular, the performance of xHSEQR is particularly sensitive to the correct choice of block parameters. Setting NS = 2 will give essentially the same performance as EISPACK . Interested users should consult [3] for a description of the timing program input files.
For the benefit of less experienced programmers, we give here a list of common programming errors in calling an LAPACK routine. These errors may cause the LAPACK routine to report a failure, as described in Section 7.2 ; they may cause an error to be reported by the system; or they may lead to wrong results - see also Section 7.3.
Some modern compilation systems, as well as software tools such as the portability checker in Toolpack [66], can check that arguments agree in number and type; and many compilation systems offer run-time detection of errors such as an array element out-of-bounds or use of an unassigned variable.
There are two ways in which an LAPACK routine may report a failure to complete a computation successfully.
If an illegal value is supplied for one of the input arguments to an LAPACK routine, it will call the error handler XERBLA to write a message to the standard output unit of the form:
** On entry to SGESV parameter number 4 had an illegal valueThis particular message would be caused by passing to SGESV a value of LDA which was less than the value of the argument N. The documentation for SGESV in Part 2 states the set of acceptable input values: ``LDA > = max(1,N).'' This is required in order that the array A with leading dimension LDA can store an n-by-n matrix. The arguments are checked in order, beginning with the first. In the above example, it may - from the user's point of view - be the value of N which is in fact wrong. Invalid arguments are often caused by the kind of error listed in Section 7.1.
In the model implementation of XERBLA which is supplied with LAPACK, execution stops after the message; but the call to XERBLA is followed by a RETURN statement in the LAPACK routine, so that if the installer removes the STOP statement in XERBLA, the result will be an immediate exit from the LAPACK routine with a negative value of INFO. It is good practice always to check for a non-zero value of INFO on return from an LAPACK routine. (We recommend however that XERBLA should not be modified to return control to the calling routine, unless absolutely necessary, since this would remove one of the built-in safety-features of LAPACK.)
A positive value of INFO on return from an LAPACK routine indicates a failure in the course of the algorithm. Common causes are:
When a failure with INFO > 0 occurs, control is always returned to the calling program; XERBLA is not called, and no error message is written. It is worth repeating that it is good practice always to check for a non-zero value of INFO on return from an LAPACK routine.
A failure with INFO > 0 may indicate any of the following:
Wrong results from LAPACK routines are most often caused by incorrect usage.
It is also possible that wrong results are caused by a bug outside of LAPACK, in the compiler or in one of the library routines, such as the BLAS, that are linked with LAPACK. Test procedures are available for both LAPACK and the BLAS, and the LAPACK installation guide [3] should be consulted for descriptions of the tests and for advice on resolving problems.
A list of known problems, compiler errors, and bugs in LAPACK routines is maintained on netlib; see Chapter 1.
Users who suspect they have found a new bug in an LAPACK routine are encouraged to report it promptly to the developers as directed in Chapter 1. The bug report should include a test case, a description of the problem and expected results, and the actions, if any, that the user has already taken to fix the bug.
We have tried to make the performance of LAPACK ``transportable'' by performing most of the computation within the Level 1, 2, and 3 BLAS, and by isolating all of the machine-dependent tuning parameters in a single integer function ILAENV . To avoid poor performance from LAPACK routines, note the following recommendations :
XXXXXX = SLAMCH( 'P' )or in double precision:
XXXXXX = DLAMCH( 'P' )A cleaner but less portable solution is for the installer to save the values computed by xLAMCH for a specific machine and create a new version of xLAMCH with these constants set in DATA statements, taking care that no accuracy is lost in the translation.
The complete LAPACK package or individual routines from LAPACK are most easily obtained through netlib [32] . At the time of this writing, the e-mail addresses for netlib are
netlib@ornl.govBoth repositories provide electronic mail and anonymous ftp service (the netlib@ornl.gov cite is available via anonymous ftp to netlib2.cs.utk.edu), and the netlib@ornl.gov cite additionally provides xnetlib . Xnetlib uses an X Windows graphical user interface and a socket-based connection between the user's machine and the xnetlib server machine to process software requests. For more information on xnetlib, echo ``send index from xnetlib'' | mail netlib@ornl.gov.
netlib@research.att.com
General information about LAPACK can be obtained by sending mail to one of the above addresses with the message
send index from lapack
The package is also available on the World Wide Web. It can be accessed through the URL address:
http://www.netlib.org/lapack/index.html
The complete package, including test code and timing programs in four different Fortran data types, constitutes some 735,000 lines of Fortran source and comments.
Alternatively, if a user does not have internet access, the complete package can be obtained on magnetic media from NAG for a cost-covering handling charge.
For further details contact NAG at one of the following addresses:
NAG Inc. NAG Ltd. 1400 Opus Place, Suite 200 Wilkinson House Downers Grove, IL 60515-5702 Jordan Hill Road USA Oxford OX2 8DR Tel: +1 708 971 2337 England Fax: +1 708 971 2706 Tel: +44 865 511245 Fax: +44 865 310139 NAG GmbH Schleissheimerstrasse 5 W-8046 Garching bei Munchen Germany Tel: +49 89 3207395 Fax: +49 89 3207396
Level 1 BLAS
dim scalar vector vector scalars 5-element prefixes array SUBROUTINE _ROTG ( A, B, C, S ) S, D SUBROUTINE _ROTMG( D1, D2, A, B, PARAM ) S, D SUBROUTINE _ROT ( N, X, INCX, Y, INCY, C, S ) S, D SUBROUTINE _ROTM ( N, X, INCX, Y, INCY, PARAM ) S, D SUBROUTINE _SWAP ( N, X, INCX, Y, INCY ) S, D, C, Z SUBROUTINE _SCAL ( N, ALPHA, X, INCX ) S, D, C, Z, CS, ZD SUBROUTINE _COPY ( N, X, INCX, Y, INCY ) S, D, C, Z SUBROUTINE _AXPY ( N, ALPHA, X, INCX, Y, INCY ) S, D, C, Z FUNCTION _DOT ( N, X, INCX, Y, INCY ) S, D, DS FUNCTION _DOTU ( N, X, INCX, Y, INCY ) C, Z FUNCTION _DOTC ( N, X, INCX, Y, INCY ) C, Z FUNCTION __DOT ( N, ALPHA, X, INCX, Y, INCY ) SDS FUNCTION _NRM2 ( N, X, INCX ) S, D, SC, DZ FUNCTION _ASUM ( N, X, INCX ) S, D, SC, DZ FUNCTION I_AMAX( N, X, INCX ) S, D, C, Z
Level 2 BLAS
options dim b-width scalar matrix vector scalar vector prefixes _GEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) S, D, C, Z _GBMV ( TRANS, M, N, KL, KU, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) S, D, C, Z _HEMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) C, Z _HBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) C, Z _HPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) C, Z _SYMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) S, D _SBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) S, D _SPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) S, D _TRMV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) S, D, C, Z _TBMV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) S, D, C, Z _TPMV ( UPLO, TRANS, DIAG, N, AP, X, INCX ) S, D, C, Z _TRSV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) S, D, C, Z _TBSV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) S, D, C, Z _TPSV ( UPLO, TRANS, DIAG, N, AP, X, INCX ) S, D, C, Z options dim scalar vector vector matrix prefixes _GER ( M, N, ALPHA, X, INCX, Y, INCY, A, LDA ) S, D _GERU ( M, N, ALPHA, X, INCX, Y, INCY, A, LDA ) C, Z _GERC ( M, N, ALPHA, X, INCX, Y, INCY, A, LDA ) C, Z _HER ( UPLO, N, ALPHA, X, INCX, A, LDA ) C, Z _HPR ( UPLO, N, ALPHA, X, INCX, AP ) C, Z _HER2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) C, Z _HPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) C, Z _SYR ( UPLO, N, ALPHA, X, INCX, A, LDA ) S, D _SPR ( UPLO, N, ALPHA, X, INCX, AP ) S, D _SYR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) S, D _SPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) S, D
Level 3 BLAS
options dim scalar matrix matrix scalar matrix prefixes _GEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) S, D, C, Z _SYMM ( SIDE, UPLO, M, N, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) S, D, C, Z _HEMM ( SIDE, UPLO, M, N, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) C, Z _SYRK ( UPLO, TRANS, N, K, ALPHA, A, LDA, BETA, C, LDC ) S, D, C, Z _HERK ( UPLO, TRANS, N, K, ALPHA, A, LDA, BETA, C, LDC ) C, Z _SYR2K( UPLO, TRANS, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) S, D, C, Z _HER2K( UPLO, TRANS, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) C, Z _TRMM ( SIDE, UPLO, TRANSA, DIAG, M, N, ALPHA, A, LDA, B, LDB ) S, D, C, Z _TRSM ( SIDE, UPLO, TRANSA, DIAG, M, N, ALPHA, A, LDA, B, LDB ) S, D, C, Z
Notes
Meaning of prefixes
S - REAL C - COMPLEX D - DOUBLE PRECISION Z - COMPLEX*16 (this may not be supported by all machines)
For the Level 2 BLAS a set of extended-precision routines with the prefixes ES, ED, EC, EZ may also be available.
Level 1 BLAS
In addition to the listed routines there are two further extended-precision dot product routines DQDOTI and DQDOTA.
Level 2 and Level 3 BLAS
Matrix types
GE - GEneral GB - General Band SY - SYmmetric SB - Symmetric Band SP - Symmetric Packed HE - HErmitian HB - Hermitian Band HP - Hermitian Packed TR - TRiangular TB - Triangular Band TP - Triangular Packed
Options
Arguments describing options are declared as CHARACTER*1 and may be passed as character strings.
TRANS = 'No transpose', 'Transpose', 'Conjugate transpose' (X, X^T, X^C) UPLO = 'Upper triangular', 'Lower triangular' DIAG = 'Non-unit triangular', 'Unit triangular' SIDE = 'Left', 'Right' (A or op(A) on the left, or A or op(A) on the right)
For real matrices, TRANS = `T' and TRANS = `C' have the same meaning.
For Hermitian matrices, TRANS = `T' is not allowed.
For complex symmetric matrices, TRANS = `H' is not allowed.
This appendix is designed to assist people to convert programs that currently call LINPACK or EISPACK routines, to call LAPACK routines instead.
LAPACK equivalents of LINPACK routines for real matrices ---------------------------------------------------------------- LINPACK LAPACK Function of LINPACK routine ---------------------------------------------------------------- SCHDC Cholesky factorization with diagonal pivoting option ---------------------------------------------------------------- SCHDD rank-1 downdate of a Cholesky factorization or the triangular factor of a QR factorization ---------------------------------------------------------------- SCHEX rank-1 update of a Cholesky factorization or the triangular factor of a QR factorization ---------------------------------------------------------------- SCHUD modifies a Cholesky factorization under permutations of the original matrix ---------------------------------------------------------------- SGBCO SLANGB LU factorization and condition estimation SGBTRF of a general band matrix SGBCON ---------------------------------------------------------------- SGBDI determinant of a general band matrix, after factorization by SGBCO or SGBFA ---------------------------------------------------------------- SGBFA SGBTRF LU factorization of a general band matrix ---------------------------------------------------------------- SGBSL SGBTRS solves a general band system of linear equations, after factorization by SGBCO or SGBFA ---------------------------------------------------------------- SGECO SLANGE LU factorization and condition SGETRF estimation of a general matrix SGECON ---------------------------------------------------------------- SGEDI SGETRI determinant and inverse of a general matrix, after factorization by SGECO or SGEFA ---------------------------------------------------------------- SGEFA SGETRF LU factorization of a general matrix ---------------------------------------------------------------- SGESL SGETRS solves a general system of linear equations, after factorization by SGECO or SGEFA ---------------------------------------------------------------- SGTSL SGTSV solves a general tridiagonal system of linear equations ---------------------------------------------------------------- SPBCO SLANSB Cholesky factorization and condition SPBTRF estimation of a symmetric positive definite SPBCON band matrix ---------------------------------------------------------------- SPBDI determinant of a symmetric positive definite band matrix, after factorization by SPBCO or SPBFA ---------------------------------------------------------------- SPBFA SPBTRF Cholesky factorization of a symmetric positive definite band matrix ---------------------------------------------------------------- SPBSL SPBTRS solves a symmetric positive definite band system of linear equations, after factorization by SPBCO or SPBFA ---------------------------------------------------------------- SPOCO SLANSY Cholesky factorization and condition SPOTRF estimation of a symmetric positive definite SPOCON matrix ---------------------------------------------------------------- SPODI SPOTRI determinant and inverse of a symmetric positive definite matrix, after factorization by SPOCO or SPOFA ---------------------------------------------------------------- SPOFA SPOTRF Cholesky factorization of a symmetric positive definite matrix ---------------------------------------------------------------- SPOSL SPOTRS solves a symmetric positive definite system of linear equations, after factorization by SPOCO or SPOFA ---------------------------------------------------------------- SPPCO SLANSY Cholesky factorization and condition SPPTRF estimation of a symmetric positive definite SPPCON matrix (packed storage) ----------------------------------------------------------------
LAPACK equivalents of LINPACK routines for real matrices(continued) ---------------------------------------------------------------- LINPACK LAPACK Function of LINPACK routine}\\ ---------------------------------------------------------------- SPPDI SPPTRI determinant and inverse of a symmetric positive definite matrix, after factorization by SPPCO or SPPFA (packed storage) ---------------------------------------------------------------- SPPFA SPPTRF Cholesky factorization of a symmetric positive definite matrix (packed storage) ---------------------------------------------------------------- SPPSL SPPTRS solves a symmetric positive definite system of linear equations, after factorization by SPPCO or SPPFA (packed storage) ---------------------------------------------------------------- SPTSL SPTSV solves a symmetric positive definite tridiagonal system of linear equations ---------------------------------------------------------------- SQRDC SGEQPF QR factorization with optional column or pivoting SGEQRF ---------------------------------------------------------------- SQRSL SORMQR solves linear least squares problems after STRSV factorization by SQRDC ---------------------------------------------------------------- SSICO SLANSY symmetric indefinite factorization and SSYTRF condition estimation of a symmetric SSYCON indefinite matrix ---------------------------------------------------------------- SSIDI SSYTRI determinant, inertia and inverse of a symmetric indefinite matrix, after factorization by SSICO or SSIFA ---------------------------------------------------------------- SSIFA SSYTRF symmetric indefinite factorization of a symmetric indefinite matrix ---------------------------------------------------------------- SSISL SSYTRS solves a symmetric indefinite system of linear equations, after factorization by SSICO or SSIFA ---------------------------------------------------------------- SSPCO SLANSP symmetric indefinite factorization and SSPTRF condition estimation of a symmetric SSPCON indefinite matrix (packed storage) ---------------------------------------------------------------- SSPDI SSPTRI determinant, inertia and inverse of a symmetric indefinite matrix, after factorization by SSPCO or SSPFA (packed storage) ---------------------------------------------------------------- SSPFA SSPTRF symmetric indefinite factorization of a symmetric indefinite matrix (packed storage) ---------------------------------------------------------------- SSPSL SSPTRS solves a symmetric indefinite system of linear equations, after factorization by SSPCO or SSPFA (packed storage) ---------------------------------------------------------------- SSVDC SGESVD all or part of the singular value decomposition of a general matrix ---------------------------------------------------------------- STRCO STRCON condition estimation of a triangular matrix ---------------------------------------------------------------- STRDI STRTRI determinant and inverse of a triangular matrix ---------------------------------------------------------------- STRSL STRTRS solves a triangular system of linear equations ----------------------------------------------------------------
Most of these working notes are available from netlib, where they can only be obtained in postscript form. To receive a list of available postscript reports, send email to netlib@ornl.gov of the form: send index from lapack/lawns
=0.15in =-.4in
A Quick Installation Guide (LAPACK Working Note 81) [35] is distributed with the complete package. This Quick Installation Guide provides installation instructions for Unix Systems. A comprehensive Installation Guide [3] (LAPACK Working Note 41), which contains descriptions of the testing and timings programs, as well as detailed non-Unix installation instructions, is also available. See also Chapter 6.
LAPACK Users' Guide
Release 2.0
This document was generated using the LaTeX2HTML translator Version 0.6.4 (Tues Aug 30 1994) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html lapack_lug.tex.
The translation was initiated by on Tue Nov 29 14:03:33 EST 1994
LAPACK has been thoroughly tested before release, on many different types of computers. The LAPACK project supports the package in the sense that reports of errors or poor performance will gain immediate attention from the developers . Such reports - and also descriptions of interesting applications and other comments - should be sent to:
LAPACK Project
c/o J. J. Dongarra
Computer Science Department
University of Tennessee
Knoxville, TN 37996-1301
USA
Email: lapack@cs.utk.edu
A list of known problems, bugs, and compiler errors for LAPACK, as well as an errata list for this guide, is maintained on netlib. For a copy of this report, send email to netlib of the form:
send release_notes from lapack
As previously mentioned in the Preface, many LAPACK-related software projects are currently available on netlib. In the context of this users' guide, several of these projects require further discussion - LAPACK++, CLAPACK, ScaLAPACK, and LAPACK routines exploiting IEEE arithmetic.
LAPACK++ is an object-oriented C++ extension to the LAPACK library. Traditionally, linear algebra libraries have been available only in Fortran. However, with an increasing number of programmers using C and C++ for scientific software development, there is a need to have high-quality numerical libraries to support these platforms as well. LAPACK++ provides the speed and efficiency competitive with native Fortran codes, while allowing programmers to capitalize on the software engineering benefits of object-oriented programming.
LAPACK++ supports various matrix classes for vectors, non-symmetric matrices, symmetric positive definite matrices, symmetric matrices, banded, triangular, and tridiagonal matrices; however, the current version does not include all of the capabilities of original Fortran 77 LAPACK. Emphasis is given to routines for solving linear systems consisting of nonsymmetric matrices, symmetric positive definite systems, and solving linear least-square systems. Future versions of LAPACK++ will support eigenvalue problems and singular value decompositions as well as distributed matrix classes for parallel computer architectures. For a more detailed description of the design of LAPACK++, please see [36]. This paper, as well as an Installation manual and Users' Guide are available on netlib. To obtain this software or documentation send a message to netlib@ornl.gov of the form:
send index from c++/lapack++Questions and comments about LAPACK++ can be directed to lapackpp@cs.utk.edu.
The CLAPACK library was built using a Fortran to C conversion utility called f2c [40]. The entire Fortran 77 LAPACK library is run through f2c to obtain C code, and then modified to improve readability. CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler.
However, f2c is designed to create C code that is still callable from Fortran, so all arguments must be passed using Fortran calling conventions and data structures. This requirement has several repercussions. The first is that since many compilers require distinct Fortran and C routine namespaces, an underscore (_) is appended to C routine names which will be called from Fortran. Therefore, f2c has added this underscore to all the names in CLAPACK. So, a call that in Fortran would look like:
call dgetrf(...)becomes in C:
dgetrf_(...);Second, the user must pass ALL arguments by reference, i.e. as pointers, since this is how Fortran works. This includes all scalar arguments like M and N. This restriction means that you cannot make a call with numbers directly in the parameter sequence. For example, consider the LU factorization of a 5-by-5 matrix. If the matrix to be factored is called A, the Fortran call
call dgetrf(5, 5, A, 5, ipiv, info)becomes in C:
M = N = LDA = 5; dgetrf_(&M, &N, A, &LDA, ipiv, &info);
Some LAPACK routines take character string arguments. In all but the testing and timing code, only the first character of the string is signficant. Therefore, the CLAPACK driver, computational, and auxiliary routines only expect single character arguments. For example, the Fortran call
call dpotrf( 'Upper', n, a, lda, info )becomes in C:
char s = 'U'; dpotrf_(&s, &n, a, &lda, &info);
In a future release we hope to provide ``wrapper'' routines that will remove the need for these unnecessary pointers, and automatically allocate (``malloc'') any workspace that is required.
As a final point, we must stress that there is a difference in the definition of a two-dimensional array in Fortran and C. A two-dimensional Fortran array declared as
DOUBLE PRECISION A(LDA, N)is a contiguous piece of LDA N double-words of memory, stored in column-major order: elements in a column are contiguous, and elements within a row are separated by a stride of LDA double-words.
In C, however, a two-dimensional array is in row-major order. Further, the rows of a two-dimensional C array need not be contiguous. The array
double A[LDA][N];actually has LDA pointers to rows of length N. These pointers can in principle be anywhere in memory. Passing such a two-dimensional C array to a CLAPACK routine will almost surely give erroneous results.
Instead, you must use a one-dimensional C array of size LDA N double-words (or else malloc the same amount of space). We recommend using the following code to get the array CLAPACK will be expecting:
double *A; A = malloc( LDA*N*sizeof(double) );Note that for best memory utilization, you would set LDA=M, the actual number of rows of A. If you now wish to operate on the matrix A, remember that A is in column-major order. As an example of accessing Fortran-style arrays in C, the following code fragments show how to initialize the array A declared above so that all of column has the value :
double *ptr; ptr = A; for(j=0; j < N; j++) { for (i=0; i < M; i++) *ptr++ = j; ptr += (LDA - M); }or, you can use:
for(j=0; j < N; j++) { for (i=0; i < M; i++) A[j*LDA+i] = j; }Note that the loop over the row index i is the inner loop, since column entries are contiguous.
The ScaLAPACK (or Scalable LAPACK) library includes a subset of LAPACK routines redesigned for distributed memory parallel computers. It is currently written in a Single-Program-Multiple-Data style using explicit message passing for interprocessor communication. It assumes matrices are laid out in a two-dimensional block cyclic decomposition. The goal is to have ScaLAPACK routines resemble their LAPACK equivalents as much as possible. Just as LAPACK is built on top of the BLAS, ScaLAPACK relies on the PBLAS (Parallel Basic Linear Algebra Subprograms) and the BLACS (Basic Linear Algebra Communication Subprograms). The PBLAS perform computations analogous to the BLAS but on matrices distributed across multiple processors. The PBLAS rely on the communication protocols of the BLACS. The BLACS are designed for linear algebra applications and provide portable communication across a wide variety of distributed-memory architectures. At the present time, they are available for the Intel Gamma, Delta, and Paragon, Thinking Machines CM-5, IBM SPs, and PVM. They will soon be available for the CRAY T3D. For more information:
echo ''send index from scalapack'' | mail netlib@ornl.govAll questions/comments can be directed to scalapack@cs.utk.edu.
We have also explored the advantages of IEEE arithmetic in implementing linear algebra routines. For example, the accurate rounding properties of IEEE arithmetic permit high precision arithmetic to be simulated economically in short stretches of code, thereby replacing possibly much more complicated low precision algorithms. Second, the ``friendly'' exception handling capabilities of IEEE arithmetic, such as being able to continue computing past an overflow and to ask later whether an overflow occurred, permit us to use simple, fast algorithms which work almost all the time, and revert to slower, safer algorithms only if the fast algorithm fails. See [23] for more details.
However, the continuing importance of machines implementing Cray arithmetic, the existence of some machines that only implement full IEEE exception handling by slowing down all floating point operations significantly, and the lack of portable ways to refer to exceptions in Fortran or C, has led us not to include these improved algorithms in this release of LAPACK. Since Cray has announced plans to convert to IEEE arithmetic, and some progress is being made on standardizing exception handling [65] we do expect to make these routines available in a future release.
The subroutines in LAPACK are classified as follows:
Both driver routines and computational routines are fully described in this Users' Guide, but not the auxiliary routines. A list of the auxiliary routines, with brief descriptions of their functions, is given in Appendix B.
LAPACK provides the same range of functionality for real and complex data.
For most computations there are matching routines, one for real and one for complex data, but there are a few exceptions. For example, corresponding to the routines for real symmetric indefinite systems of linear equations, there are routines for complex Hermitian and complex symmetric systems, because both types of complex systems occur in practical applications. However, there is no complex analogue of the routine for finding selected eigenvalues of a real symmetric tridiagonal matrix, because a complex Hermitian matrix can always be reduced to a real symmetric tridiagonal matrix.
Matching routines for real and complex data have been coded to maintain a close correspondence between the two, wherever possible. However, in some areas (especially the nonsymmetric eigenproblem) the correspondence is necessarily weaker.
All routines in LAPACK are provided in both single and double precision versions. The double precision versions have been generated automatically, using Toolpack/1 [66].
Double precision routines for complex matrices require the non-standard Fortran data type COMPLEX*16, which is available on most machines where double precision computation is usual.
The name of each LAPACK routine is a coded specification of its function (within the very tight limits of standard Fortran 77 6-character names).
All driver and computational routines have names of the form XYYZZZ, where for some driver routines the 6th character is blank.
The first letter, X, indicates the data type as follows:
S REAL D DOUBLE PRECISION C COMPLEX Z COMPLEX*16 or DOUBLE COMPLEX
When we wish to refer to an LAPACK routine generically, regardless of data type, we replace the first letter by ``x''. Thus xGESV refers to any or all of the routines SGESV, CGESV, DGESV and ZGESV.
The next two letters, YY, indicate the type of matrix (or of the most significant matrix). Most of these two-letter codes apply to both real and complex matrices; a few apply specifically to one or the other, as indicated in Table 2.1.
BD bidiagonal GB general band GE general (i.e., unsymmetric, in some cases rectangular) GG general matrices, generalized problem (i.e., a pair of general matrices) (not used in Release 1.0) GT general tridiagonal HB (complex) Hermitian band HE (complex) Hermitian HG upper Hessenberg matrix, generalized problem (i.e a Hessenberg and a triangular matrix) (not used in Release 1.0) HP (complex) Hermitian, packed storage HS upper Hessenberg OP (real) orthogonal, packed storage OR (real) orthogonal PB symmetric or Hermitian positive definite band PO symmetric or Hermitian positive definite PP symmetric or Hermitian positive definite, packed storage PT symmetric or Hermitian positive definite tridiagonal SB (real) symmetric band SP symmetric, packed storage ST (real) symmetric tridiagonal SY symmetric TB triangular band TG triangular matrices, generalized problem (i.e., a pair of triangular matrices) (not used in Release 1.0) TP triangular, packed storage TR triangular (or in some cases quasi-triangular) TZ trapezoidal UN (complex) unitary UP (complex) unitary, packed storage
When we wish to refer to a class of routines that performs the same function on different types of matrices, we replace the first three letters by ``xyy''. Thus xyySVX refers to all the expert driver routines for systems of linear equations that are listed in Table 2.2.
The last three letters ZZZ indicate the computation performed. Their meanings will be explained in Section 2.3. For example, SGEBRD is a single precision routine that performs a bidiagonal reduction (BRD) of a real general matrix.
The names of auxiliary routines follow a similar scheme except that the 2nd and 3rd characters YY are usually LA (for example, SLASCL or CLARFG). There are two kinds of exception. Auxiliary routines that implement an unblocked version of a block algorithm have similar names to the routines that perform the block algorithm, with the sixth character being ``2'' (for example, SGETF2 is the unblocked version of SGETRF). A few routines that may be regarded as extensions to the BLAS are named according to the BLAS naming schemes (for example, CROT, CSYR).
This section describes the driver routines in LAPACK. Further details on the terminology and the numerical operations they perform are given in Section 2.3, which describes the computational routines.
Two types of driver routines are provided for solving systems of linear equations :
The expert driver requires roughly twice as much storage as the simple driver in order to perform these extra functions.
Both types of driver routines can handle multiple right hand sides (the columns of B).
Different driver routines are provided to take advantage of special properties or storage schemes of the matrix A, as shown in Table 2.2.
These driver routines cover all the functionality of the computational routines for linear systems , except matrix inversion . It is seldom necessary to compute the inverse of a matrix explicitly, and it is certainly not recommended as a means of solving linear systems.
-------------------------------------------------------------------------- Type of matrix Single precision Double precision and storage scheme Operation real complex real complex -------------------------------------------------------------------------- general simple driver SGESV CGESV DGESV ZGESV expert driver SGESVX CGESVX DGESVX ZGESVX -------------------------------------------------------------------------- general band simple driver SGBSV CGBSV DGBSV ZGBSV expert driver SGBSVX CGBSVX DGBSVX ZGBSVX -------------------------------------------------------------------------- general tridiagonal simple driver SGTSV CGTSV DGTSV ZGTSV expert driver SGTSVX CGTSVX DGTSVX ZGTSVX -------------------------------------------------------------------------- symmetric/Hermitian simple driver SPOSV CPOSV DPOSV ZPOSV positive definite expert driver SPOSVX CPOSVX DPOSVX ZPOSVX -------------------------------------------------------------------------- symmetric/Hermitian simple driver SPPSV CPPSV DPPSV ZPPSV positive definite expert driver SPPSVX CPPSVX DPPSVX ZPPSVX (packed storage) -------------------------------------------------------------------------- symmetric/Hermitian simple driver SPBSV CPBSV DPBSV ZPBSV positive definite expert driver SPBSVX CPBSVX DPBSVX ZPBSVX band -------------------------------------------------------------------------- symmetric/Hermitian simple driver SPTSV CPTSV DPTSV ZPTSV positive definite expert driver SPTSVX CPTSVX DPTSVX ZPTSVX tridiagonal -------------------------------------------------------------------------- symmetric/Hermitian simple driver SSYSV CHESV DSYSV ZHESV indefinite expert driver SSYSVX CHESVX DSYSVX ZHESVX -------------------------------------------------------------------------- complex symmetric simple driver CSYSV ZSYSV expert driver CSYSVX ZSYSVX -------------------------------------------------------------------------- symmetric/Hermitian simple driver SSPSV CHPSV DSPSV ZHPSV indefinite (packed expert driver SSPSVX CHPSVX DSPSVX ZHPSVX storage) -------------------------------------------------------------------------- complex symmetric simple driver CSPSV ZSPSV (packed storage) expert driver CSPSVX ZSPSVX --------------------------------------------------------------------------
The linear least squares problem is:
where A is an m-by-n matrix, b is a given m element vector and x is the n element solution vector.
In the most usual case m > = n and rank(A) = n, and in this case the solution to problem ( 2.1) is unique, and the problem is also referred to as finding a least squares solution to an overdetermined system of linear equations.
When m < n and rank(A) = m, there are an infinite number of solutions x which exactly satisfy b - Ax = 0. In this case it is often useful to find the unique solution x which minimizes , and the problem is referred to as finding a minimum norm solution to an underdetermined system of linear equations.
The driver routine xGELS solves problem ( 2.1) on the assumption that rank(A) = min(m , n) -- in other words, A has full rank - finding a least squares solution of an overdetermined system when m > n, and a minimum norm solution of an underdetermined system when m < n. xGELS uses a QR or LQ factorization of A, and also allows A to be replaced by in the statement of the problem (or by if A is complex).
In the general case when we may have rank(A) < min(m , n) -- in other words, A may be rank-deficient - we seek the minimum norm least squares solution x which minimizes both and .
The driver routines xGELSX and xGELSS solve this general formulation of problem 2.1, allowing for the possibility that A is rank-deficient; xGELSX uses a complete orthogonal factorization of A, while xGELSS uses the singular value decomposition of A.
The LLS driver routines are listed in Table 2.3.
All three routines allow several right hand side vectors b and corresponding solutions x to be handled in a single call, storing these vectors as columns of matrices B and X, respectively. Note however that problem 2.1 is solved for each right hand side vector independently; this is not the same as finding a matrix X which minimizes .
------------------------------------------------------------------- Single precision Double precision Operation real complex real complex ------------------------------------------------------------------- solve LLS using QR or SGELS CGELS DGELS ZGELS LQ factorization solve LLS using complete SGELSX CGELSX DGELSX ZGELSX orthogonal factorization solve LLS using SVD SGELSS CGELSS DGELSS ZGELSS -------------------------------------------------------------------Table 2.3: Driver routines for linear least squares problems
Since its initial public release in February 1992, LAPACK has expanded in both depth and breadth. LAPACK is now available in both Fortran and C. The publication of this second edition of the Users' Guide coincides with the release of version 2.0 of the LAPACK software.
This release of LAPACK introduces new routines and extends the functionality of existing routines. Prominent among the new routines are driver and computational routines for the generalized nonsymmetric eigenproblem, generalized linear least squares problems, the generalized singular value decomposition, a generalized banded symmetric-definite eigenproblem, and divide-and-conquer methods for symmetric eigenproblems. Additional computational routines include the generalized QR and RQ factorizations and reduction of a band matrix to bidiagonal form.
Added functionality has been incorporated into the expert driver routines that involve equilibration (xGESVX, xGBSVX, xPOSVX, xPPSVX, and xPBSVX). The option FACT = 'F' now permits the user to input a prefactored, pre-equilibrated matrix. The expert drivers xGESVX and xGBSVX now return the reciprocal of the pivot growth from Gaussian elimination. xBDSQR has been modified to compute singular values of bidiagonal matrices much more quickly than before, provided singular vectors are not also wanted. The least squares driver routines xGELS, xGELSS, and xGELSX now make available the residual root-sum-squares for each right hand side.
All LAPACK routines reflect the current version number with the date on the routine indicating when it was last modified. For more information on revisions to the LAPACK software or this Users' Guide please refer to the LAPACK release_notes file on netlib. Instructions for obtaining this file can be found in Chapter 1.
On-line manpages (troff files) for LAPACK routines, as well as for most of the BLAS routines, are available on netlib. Refer to Section 1.6 for further details.
We hope that future releases of LAPACK will include routines for reordering eigenvalues in the generalized Schur factorization; solving the generalized Sylvester equation; computing condition numbers for the generalized eigenproblem (for eigenvalues, eigenvectors, clusters of eigenvalues, and deflating subspaces); fast algorithms for the singular value decomposition based on divide and conquer; high accuracy methods for symmetric eigenproblems and the SVD based on Jacobi's algorithm; updating and/or downdating for linear least squares problems; computing singular values by bidiagonal bisection; and computing singular vectors by bidiagonal inverse iteration.
The following additions/modifications have been made to this second edition of the Users' Guide:
Chapter 1 (Essentials) now includes information on accessing LAPACK via the World Wide Web.
Chapter 2 (Contents of LAPACK) has been expanded to discuss new routines.
Chapter 3 (Performance of LAPACK) has been updated with performance results from version 2.0 of LAPACK. In addition, a new section entitled ``LAPACK Benchmark'' has been introduced to present timings for several driver routines.
Chapter 4 (Accuracy and Stability) has been simplified and rewritten. Much of the theory and other details have been separated into ``Further Details'' sections. Example Fortran code segments are included to demonstrate the calculation of error bounds using LAPACK.
Appendices A, B, and D have been expanded to cover the new routines.
Appendix E (LAPACK Working Notes) lists a number of new Working Notes, written during the LAPACK 2 and ScaLAPACK projects (see below) and published by the University of Tennessee. The Bibliography has been updated to give the most recent published references.
The Specifications of Routines have been extended and updated to cover the new routines and revisions to existing routines.
The Bibliography and Index have been moved to the end of the book. The Index has been expanded into two indexes: Index by Keyword and Index by Routine Name. Occurrences of LAPACK, LINPACK, and EISPACK routine names have been cited in the latter index.
The original LAPACK project was funded by the NSF. Since its completion, two follow-up projects, LAPACK 2 and ScaLAPACK, have been funded in the U.S. by the NSF and ARPA in 1990-1994 and 1991-1995, respectively. In addition to making possible the additions and extensions in this release, these grants have supported the following closely related activities.
A major effort is underway to implement LAPACK-type algorithms for distributed memory machines. As a result of these efforts, several new software items are now available on netlib. The new items that have been introduced are distributed memory versions of the core routines from LAPACK; a fully parallel package to solve a symmetric positive definite sparse linear system on a message passing multiprocessor using Cholesky factorization ; a package based on Arnoldi's method for solving large-scale nonsymmetric, symmetric, and generalized algebraic eigenvalue problems ; and templates for sparse iterative methods for solving Ax = b. For more information on the availability of each of these packages, consult the scalapack and linalg indexes on netlib via netlib@ornl.gov.
We have also explored the advantages of IEEE floating point arithmetic [4] in implementing linear algebra routines. The accurate rounding properties and ``friendly'' exception handling capabilities of IEEE arithmetic permit us to write faster, more robust versions of several algorithms in LAPACK. Since all machines do not yet implement IEEE arithmetic, these algorithms are not currently part of the library [23], although we expect them to be in the future. For more information, please refer to Section 1.11.
LAPACK has been translated from Fortran into C and, in addition, a subset of the LAPACK routines has been implemented in C++ . For more information on obtaining the C or C++ versions of LAPACK, consult Section 1.11 or the clapack or c++ indexes on netlib via netlib@ornl.gov.
We deeply appreciate the careful scrutiny of those individuals who reported mistakes, typographical errors, or shortcomings in the first edition.
We acknowledge with gratitude the support which we have received from the following organizations and the help of individual members of their staff: Cray Research Inc.; NAG Ltd.
We would additionally like to thank the following people, who were not acknowledged in the first edition, for their contributions:
Françoise Chatelin, Inderjit Dhillon, Stan Eisenstat, Vince Fernando, Ming Gu, Rencang Li, Xiaoye Li, George Ostrouchov, Antoine Petitet, Chris Puscasiu, Huan Ren, Jeff Rutter, Ken Stanley, Steve Timson, and Clint Whaley.
* The royalties from the sales of this book are being placed in a fund to help students attend SIAM meetings and other SIAM related activities. This fund is administered by SIAM and qualified individuals are encouraged to write directly to SIAM for guidelines.
Driver routines are provided for two types of generalized linear least squares problems.
The first is
where A is an m-by-m matrix and B is a p-by-n matrix, c is a given m-vector, and d is a given p-vector, with p < = n < = m + p. This is called a linear equality-constrained least squares problem (LSE). The routine xGGLSE solves this problem using the generalized RQ (GRQ) factorization, on the assumptions that B has full row rank p and the matrix has full column rank n. Under these assumptions, the problem LSE has a unique solution.
The second generalized linear least squares problem is
where A is an n-by-m matrix, B is an n-by-p matrix, and d is a given n-vector, with m < = n < = m + p. This is sometimes called a general (Gauss-Markov) linear model problem (GLM). When B = I, the problem reduces to an ordinary linear least squares problem. When B is square and nonsingular, the GLM problem is equivalent to the weighted linear least squares problem:
The routine xGGGLM solves this problem using the generalized QR (GQR) factorization, on the assumptions that A has full column rank m, and the matrix (A , B) has full row rank n. Under these assumptions, the problem is always consistent, and there are unique solutions x and y. The driver routines for generalized linear least squares problems are listed in Table 2.4.
------------------------------------------------------------------ Single precision Double precision Operation real complex real complex ------------------------------------------------------------------ solve LSE problem using GQR SGGLSE CGGLSE DGGLSE ZGGLSE solve GLM problem using GQR SGGGLM CGGGLM DGGGLM ZGGGLM ------------------------------------------------------------------Table 2.4: Driver routines for generalized linear least squares problems
The symmetric eigenvalue problem is to find the eigenvalues , , and corresponding eigenvectors , , such that
For the Hermitian eigenvalue problem we have
For both problems the eigenvalues are real.
When all eigenvalues and eigenvectors have been computed, we write:
where is a diagonal matrix whose diagonal elements are the eigenvalues , and Z is an orthogonal (or unitary) matrix whose columns are the eigenvectors. This is the classical spectral factorization of A.
Three types of driver routines are provided for symmetric or Hermitian eigenproblems:
Different driver routines are provided to take advantage of special structure or storage of the matrix A, as shown in Table 2.5.
In the future LAPACK will include routines based on the Jacobi algorithm [76] [69] [24], which are slower than the above routines but can be significantly more accurate.
The nonsymmetric eigenvalue problem is to find the eigenvalues , , and corresponding eigenvectors , , such that
A real matrix A may have complex eigenvalues, occurring as complex conjugate pairs. More precisely, the vector v is called a right eigenvector of A, and a vector satisfying
is called a left eigenvector of A.
This problem can be solved via the Schur factorization of A, defined in the real case as
where Z is an orthogonal matrix and T is an upper quasi-triangular matrix with 1-by-1 and 2-by-2 diagonal blocks, the 2-by-2 blocks corresponding to complex conjugate pairs of eigenvalues of A. In the complex case the Schur factorization is
where Z is unitary and T is a complex upper triangular matrix.
The columns of Z are called the Schur vectors . For each k (1 < = k < = n) , the first k columns of Z form an orthonormal basis for the invariant subspace corresponding to the first k eigenvalues on the diagonal of T. Because this basis is orthonormal, it is preferable in many applications to compute Schur vectors rather than eigenvectors. It is possible to order the Schur factorization so that any desired set of k eigenvalues occupy the k leading positions on the diagonal of T.
Two pairs of drivers are provided, one pair focusing on the Schur factorization, and the other pair on the eigenvalues and eigenvectors as shown in Table 2.5:
The singular value decomposition of an m-by-n matrix A is given by
where U and V are orthogonal (unitary) and is an m-by-n diagonal matrix with real diagonal elements, , such that
The are the singular values of A and the first min(m , n) columns of U and V are the left and right singular vectors of A.
The singular values and singular vectors satisfy:
where and are the i-th columns of U and V respectively.
A single driver routine xGESVD computes all or part of the singular value decomposition of a general nonsymmetric matrix (see Table 2.5). A future version of LAPACK will include a driver based on divide and conquer, as in section 2.2.4.1.
-------------------------------------------------------------------------- Type of Single precision Double precision problem Function and storage scheme real complex real complex -------------------------------------------------------------------------- SEP simple driver SSYEV CHEEV DSYEV ZHEEV expert driver SSYEVX CHEEVX DSYEVX ZHEEVX -------------------------------------------------------------------------- simple driver (packed storage) SSPEV CHPEV DSPEV ZHPEV expert driver (packed storage) SSPEVX CHPEVX DSPEVX ZHPEVX -------------------------------------------------------------------------- simple driver (band matrix) SSBEV CHBEV DSBEV ZHBEV expert driver (band matrix) SSBEVX CHBEVX DSBEVX ZHBEVX -------------------------------------------------------------------------- simple driver (tridiagonal SSTEV DSTEV matrix) expert driver (tridiagonal SSTEVX DSTEVX matrix) -------------------------------------------------------------------------- NEP simple driver for SGEES CGEES DGEES ZGEES Schur factorization expert driver for SGEESX CGEESX DGEESX ZGEESX Schur factorization simple driver for SGEEV CGEEV DGEEV ZGEEV eigenvalues/vectors expert driver for SGEEVX CGEEVX DGEEVX ZGEEVX eigenvalues/vectors -------------------------------------------------------------------------- SVD singular values/vectors SGESVD CGESVD DGESVD ZGESVD --------------------------------------------------------------------------
Simple drivers are provided to compute all the eigenvalues and (optionally) the eigenvectors of the following types of problems:
where A and B are symmetric or Hermitian and B is positive definite. For all these problems the eigenvalues are real. The matrices Z of computed eigenvectors satisfy (problem types 1 and 3) or (problem type 2), where is a diagonal matrix with the eigenvalues on the diagonal. Z also satisfies (problem types 1 and 2) or (problem type 3).
The routines are listed in Table 2.6.
Given two square matrices A and B, the generalized nonsymmetric eigenvalue problem is to find the eigenvalues and corresponding eigenvectors such that
or find the eigenvalues and corresponding eigenvectors such that
Note that these problems are equivalent with and if neither nor is zero. In order to deal with the case that or is zero, or nearly so, the LAPACK routines return two values, and , for each eigenvalue, such that and .
More precisely, and are called right eigenvectors. Vectors or satisfying
If the determinant of is zero for all values of , the eigenvalue problem is called singular, and is signaled by some (in the presence of roundoff, and may be very small). In this case the eigenvalue problem is very ill-conditioned, and in fact some of the other nonzero values of and may be indeterminate [21] [80] [71].
The generalized nonsymmetric eigenvalue problem can be solved via the generalized Schur factorization of the pair A,B, defined in the real case as
where Q and Z are orthogonal matrices, P is upper triangular, and S is an upper quasi-triangular matrix with 1-by-1 and 2-by-2 diagonal blocks, the 2-by-2 blocks corresponding to complex conjugate pairs of eigenvalues of A,B. In the complex case the Schur factorization is
where Q and Z are unitary and S and P are both upper triangular.
The columns of Q and Z are called generalized Schur vectors and span pairs of deflating subspaces of A and B [72]. Deflating subspaces are a generalization of invariant subspaces: For each k (1 < = k < = n), the first k columns of Z span a right deflating subspace mapped by both A and B into a left deflating subspace spanned by the first k columns of Q.
Two simple drivers are provided for the nonsymmetric problem :
The generalized (or quotient) singular value decomposition of an m-by-n matrix A and a p-by-n matrix B is given by the pair of factorizations
The matrices in these factorizations have the following properties:
and have the following detailed structures, depending on whether m - r > = 0 or m - r < 0. In the first case, m - r > = 0, then
Here l is the rank of B, m = r - 1, C and S are diagonal matrices satisfying , and S is nonsingular. We may also identify , for , , and for . Thus, the first k generalized singular values are infinite, and the remaining l generalized singular values are finite.
In the second case, when m - r < 0,
and
Again, l is the rank of B, k = r - 1, C and S are diagonal matrices satisfying , S is nonsingular, and we may identify , for , , , for , and . Thus, the first generalized singular values are infinite, and the remaining generalized singular values are finite.
Here are some important special cases of the generalized singular value decomposition. First, if B is square and nonsingular, then r = n and the generalized singular value decomposition of A and B is equivalent to the singular value decomposition of , where the singular values of are equal to the generalized singular values of the pair A,B:
Second, if the columns of are orthonormal, then r = n, R = I and the generalized singular value decomposition of A and B is equivalent to the CS (Cosine-Sine) decomposition of [45]:
Third, the generalized eigenvalues and eigenvectors of can be expressed in terms of the generalized singular value decomposition: Let
Then
Therefore, the columns of X are the eigenvectors of , and the ``nontrivial'' eigenvalues are the squares of the generalized singular values (see also section 2.2.5.1). ``Trivial'' eigenvalues are those corresponding to the leading n - r columns of X, which span the common null space of and . The ``trivial eigenvalues'' are not well defined .
A single driver routine xGGSVD computes the generalized singular value decomposition of A and B (see Table 2.6). The method is based on the method described in [12] [10] [62].
---------------------------------------------------------------- Type of Function and Single precision Double precision problem storage scheme real complex real complex ---------------------------------------------------------------- GSEP simple driver SSYGV CHEGV DSYGV ZHEGV simple driver SSPGV CHPGV DSPGV ZHPGV (packed storage) simple driver SSBGV CHBGV DSBGV ZHBGV (band matrices) ---------------------------------------------------------------- GNEP simple driver for SGEGS CGEGS DGEGS ZGEGS Schur factorization simple driver for SGEGV CGEGV DGEGV ZGEGV eigenvalues/vectors ---------------------------------------------------------------- GSVD singular values/ SGGSVD CGGSVD DGGSVD ZGGSVD vectors -----------------------------------------------------------------Table 2.6: Driver routines for generalized eigenvalue and singular value problems
The development of LAPACK was a natural step after specifications of the Level 2 and 3 BLAS were drawn up in 1984-86 and 1987-88. Research on block algorithms had been ongoing for several years, but agreement on the BLAS made it possible to construct a new software package to take the place of LINPACK and EISPACK, which would achieve much greater efficiency on modern high-performance computers. This also seemed to be a good time to implement a number of algorithmic advances that had been made since LINPACK and EISPACK were written in the 1970's. The proposal for LAPACK was submitted while the Level 3 BLAS were still being developed and funding was obtained from the National Science Foundation (NSF) beginning in 1987.
LAPACK is more than just a more efficient update of its popular predecessors. It extends the functionality of LINPACK and EISPACK by including: driver routines for linear systems; equilibration, iterative refinement and error bounds for linear systems; routines for computing and re-ordering the Schur factorization; and condition estimation routines for eigenvalue problems. LAPACK improves on the accuracy of the standard algorithms in EISPACK by including high accuracy algorithms for finding singular values and eigenvalues of bidiagonal and tridiagonal matrices, respectively, that arise in SVD and symmetric eigenvalue problems.
We have tried to be consistent with our documentation and coding style throughout LAPACK in the hope that LAPACK will serve as a model for other software development efforts. In particular, we hope that LAPACK and this guide will be of value in the classroom. But above all, LAPACK has been designed to be used for serious computation, especially as a source of building blocks for larger applications.
The LAPACK project has been a research project on achieving good performance in a portable way over a large class of modern computers. This goal has been achieved, subject to the following qualifications. For optimal performance, it is necessary, first, that the BLAS are implemented efficiently on the target machine, and second, that a small number of tuning parameters (such as the block size) have been set to suitable values (reasonable default values are provided). Most of the LAPACK code is written in standard Fortran 77, but the double precision complex data type is not part of the standard, so we have had to make some assumptions about the names of intrinsic functions that do not hold on all machines (see section 6.1). Finally, our rigorous testing suite included test problems scaled at the extremes of the arithmetic range, which can vary greatly from machine to machine. On some machines, we have had to restrict the range more than on others.
Since most of the performance improvements in LAPACK come from restructuring the algorithms to use the Level 2 and 3 BLAS, we benefited greatly by having access from the early stages of the project to a complete set of BLAS developed for the CRAY machines by Cray Research. Later, the BLAS library developed by IBM for the IBM RISC/6000 was very helpful in proving the worth of block algorithms and LAPACK on ``super-scalar'' workstations. Many of our test sites, both computer vendors and research institutions, also worked on optimizing the BLAS and thus helped to get good performance from LAPACK. We are very pleased at the extent to which the user community has embraced the BLAS, not only for performance reasons, but also because we feel developing software around a core set of common routines like the BLAS is good software engineering practice.
A number of technical reports were written during the development of LAPACK and published as LAPACK Working Notes, initially by Argonne National Laboratory and later by the University of Tennessee. Many of these reports later appeared as journal articles. Appendix E lists the LAPACK Working Notes, and the Bibliography gives the most recent published reference.
A follow-on project, LAPACK 2, has been funded in the U.S. by the NSF and DARPA. One of its aims will be to add a modest amount of additional functionality to the current LAPACK package - for example, routines for the generalized SVD and additional routines for generalized eigenproblems. These routines will be included in a future release of LAPACK when they are available. LAPACK 2 will also produce routines which implement LAPACK-type algorithms for distributed memory machines, routines which take special advantage of IEEE arithmetic, and versions of parts of LAPACK in C and Fortran 90. The precise form of these other software packages which will result from LAPACK 2 has not yet been decided.
As the successor to LINPACK and EISPACK, LAPACK has drawn heavily on both the software and documentation from those collections. The test and timing software for the Level 2 and 3 BLAS was used as a model for the LAPACK test and timing software, and in fact the LAPACK timing software includes the BLAS timing software as a subset. Formatting of the software and conversion from single to double precision was done using Toolpack/1 [66], which was indispensable to the project. We owe a great debt to our colleagues who have helped create the infrastructure of scientific computing on which LAPACK has been built.
The development of LAPACK was primarily supported by NSF grant ASC-8715728. Zhaojun Bai had partial support from DARPA grant F49620-87-C0065; Christian Bischof was supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under contract W-31-109-Eng-38; James Demmel had partial support from NSF grant DCR-8552474; and Jack Dongarra had partial support from the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under Contract DE-AC05-84OR21400.
The cover was designed by Alan Edelman at UC Berkeley who discovered the matrix by performing Gaussian elimination on a certain 20-by-20 Hadamard matrix.
We acknowledge with gratitude the support which we have received from the following organizations, and the help of individual members of their staff: Cornell Theory Center; Cray Research Inc.; IBM ECSEC Rome; IBM Scientific Center, Bergen; NAG Ltd.
We also thank many, many people who have contributed code, criticism, ideas and encouragement. We wish especially to acknowledge the contributions of: Mario Arioli, Mir Assadullah, Jesse Barlow, Mel Ciment, Percy Deift, Augustin Dubrulle, Iain Duff, Alan Edelman, Victor Eijkhout, Sam Figueroa, Pat Gaffney, Nick Higham, Liz Jessup, Bo Kågström, Velvel Kahan, Linda Kaufman, L.-C. Li, Bob Manchek, Peter Mayes, Cleve Moler, Beresford Parlett, Mick Pont, Giuseppe Radicati, Tom Rowan, Pete Stewart, Peter Tang, Carlos Tomei, Charlie Van Loan, Kresimir Veselic, Phuong Vu, and Reed Wade.
Finally we thank all the test sites who received three preliminary distributions of LAPACK software and who ran an extensive series of test programs and timing programs for us; their efforts have influenced the final version of the package in numerous ways.
* The royalties from the sales of this book are being placed in a fund to help students attend SIAM meetings and other SIAM related activities. This fund is administered by SIAM and qualified individuals are encouraged to write directly to SIAM for guidelines.
We use the standard notation for a system of simultaneous linear equations :
where A is the coefficient matrix, b is the right hand side, and x is the solution. In ( 2.4) A is assumed to be a square matrix of order n, but some of the individual routines allow A to be rectangular. If there are several right hand sides, we write
where the columns of B are the individual right hand sides, and the columns of X are the corresponding solutions. The basic task is to compute X, given A and B.
If A is upper or lower triangular, ( 2.4) can be solved by a straightforward process of backward or forward substitution. Otherwise, the solution is obtained after first factorizing A as a product of triangular matrices (and possibly also a diagonal matrix or permutation matrix).
The form of the factorization depends on the properties of the matrix A. LAPACK provides routines for the following types of matrices, based on the stated factorizations:
A = PLU
where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n), and U is upper triangular (upper trapezoidal if m < n).
A = LU
where L is a product of permutation and unit lower triangular matrices with kl subdiagonals, and U is upper triangular with kl + ku superdiagonals.
where U is an upper triangular matrix and L is lower triangular.
where U is a unit upper bidiagonal matrix, L is unit lower bidiagonal, and D is diagonal.
where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is symmetric and block diagonal with diagonal blocks of order 1 or 2.
The factorization for a general tridiagonal matrix is like that for a general band matrix with kl = 1 and ku = 1. The factorization for a symmetric positive definite band matrix with k superdiagonals (or subdiagonals) has the same form as for a symmetric positive definite matrix, but the factor U (or L) is a band matrix with k superdiagonals (subdiagonals). Band matrices use a compact band storage scheme described in section 5.3.3. LAPACK routines are also provided for symmetric matrices (whether positive definite or indefinite) using packed storage, as described in section 5.3.2.
While the primary use of a matrix factorization is to solve a system of equations, other related tasks are provided as well. Wherever possible, LAPACK provides routines to perform each of these tasks for each type of matrix and storage scheme (see Tables 2.7 and 2.8). The following list relates the tasks to the last 3 characters of the name of the corresponding computational routine:
Note that some of the above routines depend on the output of others:
The RFS (``refine solution'') routines perform iterative refinement and compute backward and forward error bounds for the solution. Iterative refinement is done in the same precision as the input data. In particular, the residual is not computed with extra precision, as has been traditionally done. The benefit of this procedure is discussed in Section 4.4.
-------------------------------------------------------------------------------- Type of matrix Operation Single precision Double precision and storage scheme real complex real complex -------------------------------------------------------------------------------- general factorize SGETRF CGETRF DGETRF ZGETRF solve using factorization SGETRS CGETRS DGETRS ZGETRS estimate condition number SGECON CGECON DGECON ZGECON error bounds for solution SGERFS CGERFS DGERFS ZGERFS invert using factorization SGETRI CGETRI DGETRI ZGETRI equilibrate SGEEQU CGEEQU DGEEQU ZGEEQU -------------------------------------------------------------------------------- general factorize SGBTRF CGBTRF DGBTRF ZGBTRF band solve using factorization SGBTRS CGBTRS DGBTRS ZGBTRS estimate condition number SGBCON CGBCON DGBCON ZGBCON error bounds for solution SGBRFS CGBRFS DGBRFS ZGBRFS equilibrate SGBEQU CGBEQU DGBEQU ZGBEQU -------------------------------------------------------------------------------- general factorize SGTTRF CGTTRF DGTTRF ZGTTRF tridiagonal solve using factorization SGTTRS CGTTRS DGTTRS ZGTTRS estimate condition number SGTCON CGTCON DGTCON ZGTCON error bounds for solution SGTRFS CGTRFS DGTRFS ZGTRFS -------------------------------------------------------------------------------- symmetric/ factorize SPOTRF CPOTRF DPOTRF ZPOTRF Hermitian solve using factorization SPOTRS CPOTRS DPOTRS ZPOTRS positive definite estimate condition number SPOCON CPOCON DPOCON ZPOCON error bounds for solution SPORFS CPORFS DPORFS ZPORFS invert using factorization SPOTRI CPOTRI DPOTRI ZPOTRI equilibrate SPOEQU CPOEQU DPOEQU ZPOEQU -------------------------------------------------------------------------------- symmetric/ factorize SPPTRF CPPTRF DPPTRF ZPPTRF Hermitian solve using factorization SPPTRS CPPTRS DPPTRS ZPPTRS positive definite estimate condition number SPPCON CPPCON DPPCON ZPPCON (packed storage) error bounds for solution SPPRFS CPPRFS DPPRFS ZPPRFS invert using factorization SPPTRI CPPTRI DPPTRI ZPPTRI equilibrate SPPEQU CPPEQU DPPEQU ZPPEQU -------------------------------------------------------------------------------- symmetric/ factorize SPBTRF CPBTRF DPBTRF ZPBTRF Hermitian solve using factorization SPBTRS CPBTRS DPBTRS ZPBTRS positive definite estimate condition number SPBCON CPBCON DPBCON ZPBCON band error bounds for solution SPBRFS CPBRFS DPBRFS ZPBRFS equilibrate SPBEQU CPBEQU DPBEQU ZPBEQU -------------------------------------------------------------------------------- symmetric/ factorize SPTTRF CPTTRF DPTTRF ZPTTRF Hermitian solve using factorization SPTTRS CPTTRS DPTTRS ZPTTRS positive definite estimate condition number SPTCON CPTCON DPTCON ZPTCON tridiagonal error bounds for solution SPTRFS CPTRFS DPTRFS ZPTRFS --------------------------------------------------------------------------------Table 2.7: Computational routines for linear equations
Table 2.8: Computational routines for linear equations (continued)
LAPACK provides a number of routines for factorizing a general rectangular m-by-n matrix A, as the product of an orthogonal matrix (unitary if complex) and a triangular (or possibly trapezoidal) matrix.
A real matrix Q is orthogonal if ; a complex matrix Q is unitary if . Orthogonal or unitary matrices have the important property that they leave the two-norm of a vector invariant:
As a result, they help to maintain numerical stability because they do not amplify rounding errors.
Orthogonal factorizations are used in the solution of linear least squares problems . They may also be used to perform preliminary steps in the solution of eigenvalue or singular value problems.
The most common, and best known, of the factorizations is the QR factorization given by
where R is an n-by-n upper triangular matrix and Q is an m-by-m orthogonal (or unitary) matrix. If A is of full rank n, then R is non-singular. It is sometimes convenient to write the factorization as
which reduces to
where consists of the first n columns of Q, and the remaining m - n columns.
If m < n, R is trapezoidal, and the factorization can be written
where is upper triangular and is rectangular.
The routine xGEQRF computes the QR factorization. The matrix Q is not formed explicitly, but is represented as a product of elementary reflectors, as described in section 5.4. Users need not be aware of the details of this representation, because associated routines are provided to work with Q: xORGQR (or xUNGQR in the complex case) can generate all or part of R, while xORMQR (or xUNMQR ) can pre- or post-multiply a given matrix by Q or ( if complex).
The QR factorization can be used to solve the linear least squares problem ( 2.1) when m > = n and A is of full rank, since
c can be computed by xORMQR (or xUNMQR ), and consists of its first n elements. Then x is the solution of the upper triangular system
which can be computed by xTRTRS . The residual vector r is given by
and may be computed using xORMQR (or xUNMQR ). The residual sum of squares may be computed without forming r explicitly, since
The LQ factorization is given by
where L is m-by-m lower triangular, Q is n-by-n orthogonal (or unitary), consists of the first m rows of Q, and the remaining n - m rows.
This factorization is computed by the routine xGELQF, and again Q is represented as a product of elementary reflectors; xORGLQ (or xUNGLQ in the complex case) can generate all or part of Q, and xORMLQ (or xUNMLQ ) can pre- or post-multiply a given matrix by Q or ( if Q is complex).
The LQ factorization of A is essentially the same as the QR factorization of ( if A is complex), since
The LQ factorization may be used to find a minimum norm solution of an underdetermined system of linear equations Ax = b where A is m-by-n with m < n and has rank m. The solution is given by
and may be computed by calls to xTRTRS and xORMLQ.
To solve a linear least squares problem ( 2.1) when A is not of full rank, or the rank of A is in doubt, we can perform either a QR factorization with column pivoting or a singular value decomposition (see subsection 2.3.6).
The QR factorization with column pivoting is given by
where Q and R are as before and P is a permutation matrix, chosen (in general) so that
and moreover, for each k,
In exact arithmetic, if rank(A) = k, then the whole of the submatrix in rows and columns k + 1 to n would be zero. In numerical computation, the aim must be to determine an index k, such that the leading submatrix in the first k rows and columns is well-conditioned, and is negligible:
Then k is the effective rank of A. See Golub and Van Loan [45] for a further discussion of numerical rank determination.
The so-called basic solution to the linear least squares problem ( 2.1) can be obtained from this factorization as
where consists of just the first k elements of .
The routine xGEQPF computes the QR factorization with column pivoting, but does not attempt to determine the rank of A. The matrix Q is represented in exactly the same way as after a call of xGEQRF , and so the routines xORGQR and xORMQR can be used to work with Q (xUNGQR and xUNMQR if Q is complex).
The QR factorization with column pivoting does not enable us to compute a minimum norm solution to a rank-deficient linear least squares problem, unless . However, by applying further orthogonal (or unitary) transformations from the right to the upper trapezoidal matrix , using the routine xTZRQF, can be eliminated:
This gives the complete orthogonal factorization
from which the minimum norm solution can be obtained as
The QL and RQ factorizations are given by
and
These factorizations are computed by xGEQLF and xGERQF, respectively; they are less commonly used than either the QR or LQ factorizations described above, but have applications in, for example, the computation of generalized QR factorizations [2].
All the factorization routines discussed here (except xTZRQF) allow arbitrary m and n, so that in some cases the matrices R or L are trapezoidal rather than triangular. A routine that performs pivoting is provided only for the QR factorization.
--------------------------------------------------------------------------- Type of factorization Single precision Double precision and matrix Operation real complex real complex --------------------------------------------------------------------------- QR, general factorize with pivoting SGEQPF CGEQPF DGEQPF ZGEQPF factorize, no pivoting SGEQRF CGEQRF DGEQRF ZGEQRF generate Q SORGQR CUNGQR DORGQR ZUNGQR multiply matrix by Q SORMQR CUNMQR DORMQR ZUNMQR --------------------------------------------------------------------------- LQ, general factorize, no pivoting SGELQF CGELQF DGELQF ZGELQF generate Q SORGLQ CUNGLQ DORGLQ ZUNGLQ multiply matrix by Q SORMLQ CUNMLQ DORMLQ ZUNMLQ --------------------------------------------------------------------------- QL, general factorize, no pivoting SGEQLF CGEQLF DGEQLF ZGEQLF generate Q SORGQL CUNGQL DORGQL ZUNGQL multiply matrix by Q SORMQL CUNMQL DORMQL ZUNMQL --------------------------------------------------------------------------- RQ, general factorize, no pivoting SGERQF CGERQF DGERQF ZGERQF generate Q SORGRQ CUNGRQ DORGRQ ZUNGRQ multiply matrix by Q SORMRQ CUNMRQ DORMRQ ZUNMRQ --------------------------------------------------------------------------- RQ, trapezoidal factorize, no pivoting STZRQF CTZRQF DTZRQF ZTZRQF ---------------------------------------------------------------------------Table 2.9: Computational routines for orthogonal factorizations
The generalized QR (GQR) factorization of an n-by-m matrix A and an n-by-p matrix B is given by the pair of factorizations
A = QR and B = QTZ
where Q and Z are respectively n-by-n and p-by-p orthogonal matrices (or unitary matrices if A and B are complex). R has the form:
or
where is upper triangular. T has the form
or
where or is upper triangular.
Note that if B is square and nonsingular, the GQR factorization of A and B implicitly gives the QR factorization of the matrix :
without explicitly computing the matrix inverse or the product .
The routine xGGQRF computes the GQR factorization by first computing the QR factorization of A and then the RQ factorization of . The orthogonal (or unitary) matrices Q and Z can either be formed explicitly or just used to multiply another given matrix in the same way as the orthogonal (or unitary) matrix in the QR factorization (see section 2.3.2).
The GQR factorization was introduced in [63] [49]. The implementation of the GQR factorization here follows [2]. Further generalizations of the GQR factorization can be found in [25].
The GQR factorization can be used to solve the general (Gauss-Markov) linear model problem (GLM) (see ( 2.3) and [60][page 252]GVL2). Using the GQR factorization of A and B, we rewrite the equation d = Ax + By from ( 2.3) as
We partition this as
where
can be computed by xORMQR (or xUNMQR).
The GLM problem is solved by setting
from which we obtain the desired solutions
which can be computed by xTRSV, xGEMV and xORMRQ (or xUNMRQ).
The generalized RQ (GRQ) factorization of an m-by-n matrix A and a p-by-n matrix B is given by the pair of factorizations
A = RQ and B = ZTQ
where Q and Z are respectively n-by-n and p-by-p orthogonal matrices (or unitary matrices if A and B are complex). R has the form
or
where or is upper triangular. T has the form
or
where is upper triangular.
Note that if B is square and nonsingular, the GRQ factorization of A and B implicitly gives the RQ factorization of the matrix :
without explicitly computing the matrix inverse or the product .
The routine xGGRQF computes the GRQ factorization by first computing the RQ factorization of A and then the QR factorization of . The orthogonal (or unitary) matrices Q and Z can either be formed explicitly or just used to multiply another given matrix in the same way as the orthogonal (or unitary) matrix in the RQ factorization (see section 2.3.2).
The GRQ factorization can be used to solve the linear equality-constrained least squares problem (LSE) (see ( 2.2) and [page 567]GVL2). We use the GRQ factorization of B and A (note that B and A have swapped roles), written as
B = TQ and A = ZRQ
We write the linear equality constraints Bx = d as:
TQx = d
which we partition as:
Therefore is the solution of the upper triangular system
Furthermore,
We partition this expression as:
where , which can be computed by xORMQR (or xUNMQR).
To solve the LSE problem, we set
which gives as the solution of the upper triangular system
Finally, the desired solution is given by
which can be computed by xORMRQ (or xUNMRQ).
Let A be a real symmetric or complex Hermitian n-by-n matrix. A scalar is called an eigenvalue and a nonzero column vector z the corresponding eigenvector if . is always real when A is real symmetric or complex Hermitian.
The basic task of the symmetric eigenproblem routines is to compute values of and, optionally, corresponding vectors z for a given matrix A.
This computation proceeds in the following stages:
In the real case, the decomposition is computed by one of the routines xSYTRD , xSPTRD, or xSBTRD, depending on how the matrix is stored (see Table 2.10). The complex analogues of these routines are called xHETRD, xHPTRD, and xHBTRD. The routine xSYTRD (or xHETRD) represents the matrix Q as a product of elementary reflectors, as described in section 5.4. The routine xORGTR (or in the complex case xUNMTR) is provided to form Q explicitly; this is needed in particular before calling xSTEQR to compute all the eigenvectors of A by the QR algorithm. The routine xORMTR (or in the complex case xUNMTR) is provided to multiply another matrix by Q without forming Q explicitly; this can be used to transform eigenvectors of T computed by xSTEIN, back to eigenvectors of A.
When packed storage is used, the corresponding routines for forming Q or multiplying another matrix by Q are xOPGTR and xOPMTR (in the complex case, xUPGTR and xUPMTR).
When A is banded and xSBTRD (or xHBTRD) is used to reduce it to tridiagonal form , Q is determined as a product of Givens rotations , not as a product of elementary reflectors; if Q is required, it must be formed explicitly by the reduction routine. xSBTRD is based on the vectorizable algorithm due to Kaufman [57].
There are several routines for computing eigenvalues and eigenvectors of T, to cover the cases of computing some or all of the eigenvalues, and some or all of the eigenvectors. In addition, some routines run faster in some computing environments or for some matrices than for others. Also, some routines are more accurate than other routines.
See Table 2.10.
------------------------------------------------------------------------------ Type of matrix Single precision Double precision and storage scheme Operation real complex real complex ------------------------------------------------------------------------------ dense symmetric tridiagonal reduction SSYTRD CHETRD DSYTRD ZHETRD (or Hermitian) ------------------------------------------------------------------------------ packed symmetric tridiagonal reduction SSPTRD CHPTRD DSPTRD ZHPTRD (or Hermitian) ------------------------------------------------------------------------------ band symmetric tridiagonal reduction SSBTRD CHBTRD DSBTRD ZHBTRD (or Hermitian) orthogonal/unitary generate matrix after SORGTR CUNGTR DORGTR ZUNGTR reduction by xSYTRD multiply matrix after SORMTR CUNMTR DORMTR ZUNMTR reduction by xSYTRD ------------------------------------------------------------------------------ orthogonal/unitary generate matrix after SOPGTR CUPGTR DOPGTR ZUPGTR (packed storage) reduction by xSPTRD multiply matrix after SOPMTR CUPMTR DOPMTR ZUPMTR reduction by xSPTRD ------------------------------------------------------------------------------ symmetric eigenvalues/ SSTEQR CSTEQR DSTEQR ZSTEQR tridiagonal eigenvectors via QR eigenvalues only SSTERF DSTERF via root-free QR eigenvalues only SSTEBZ DSTEBZ via bisection eigenvectors by SSTEIN CSTEIN DSTEIN ZSTEIN inverse iteration ------------------------------------------------------------------------------ symmetric eigenvalues/ SPTEQR CPTEQR DPTEQR ZPTEQR tridiagonal eigenvectors positive definite ------------------------------------------------------------------------------Table 2.10: Computational routines for the symmetric eigenproblem
Let A be a square n-by-n matrix. A scalar is called an eigenvalue and a non-zero column vector v the corresponding right eigenvector if . A nonzero column vector u satisfying is called the left eigenvector . The first basic task of the routines described in this section is to compute, for a given matrix A, all n values of and, if desired, their associated right eigenvectors v and/or left eigenvectors u.
A second basic task is to compute the Schur factorization of a matrix A. If A is complex, then its Schur factorization is , where Z is unitary and T is upper triangular. If A is real, its Schur factorization is , where Z is orthogonal. and T is upper quasi-triangular (1-by-1 and 2-by-2 blocks on its diagonal). The columns of Z are called the Schur vectors of A. The eigenvalues of A appear on the diagonal of T; complex conjugate eigenvalues of a real A correspond to 2-by-2 blocks on the diagonal of T.
These two basic tasks can be performed in the following stages:
Other subsidiary tasks may be performed before or after those just described.
The routine xGEBAL may be used to balance the matrix A prior to reduction to Hessenberg form . Balancing involves two steps, either of which is optional:
where P is a permutation matrix and and are upper triangular. Thus the matrix is already in Schur form outside the central diagonal block in rows and columns ILO to IHI. Subsequent operations by xGEBAL, xGEHRD or xHSEQR need only be applied to these rows and columns; therefore ILO and IHI are passed as arguments to xGEHRD and xHSEQR . This can save a significant amount of work if ILO > 1 or IHI < n. If no suitable permutation can be found (as is very often the case), xGEBAL sets ILO = 1 and IHI = n, and is the whole of A.
This can improve the accuracy of later processing in some cases; see subsection 4.8.1.2.
If A was balanced by xGEBAL, then eigenvectors computed by subsequent operations are eigenvectors of the balanced matrix ; xGEBAK must then be called to transform them back to eigenvectors of the original matrix A.
The Schur form
depends on the order of the eigenvalues on the diagonal
of T and this may optionally be chosen by the user. Suppose the user chooses
that
,
1 < = j < = n, appear in the upper left
corner of T. Then the first j columns of Z span the right invariant
subspace of A corresponding to
.
The following routines perform this re-ordering and also compute condition numbers for eigenvalues, eigenvectors, and invariant subspaces:
See Table 2.11 for a complete list of the routines.
----------------------------------------------------------------------------- Type of matrix Single precision Double precision and storage scheme Operation real complex real complex ----------------------------------------------------------------------------- general Hessenberg reduction SGEHRD CGEHRD DGEHRD ZGEHRD balancing SGEBAL CGEBAL DGEBAL ZGEBAL backtransforming SGEBAK CGEBAK DGEBAK ZGEBAK ----------------------------------------------------------------------------- orthogonal/unitary generate matrix after SORGHR CUNGHR DORGHR ZUNGHR Hessenberg reduction multiply matrix after SORMHR CUNMHR DORMHR ZUNMHR Hessenberg reduction ----------------------------------------------------------------------------- Hessenberg Schur factorization SHSEQR CHSEQR DHSEQR ZHSEQR eigenvectors by SHSEIN CHSEIN DHSEIN ZHSEIN inverse iteration ----------------------------------------------------------------------------- (quasi)triangular eigenvectors STREVC CTREVC DTREVC ZTREVC reordering Schur STREXC CTREXC DTREXC ZTREXC factorization Sylvester equation STRSYL CTRSYL DTRSYL ZTRSYL condition numbers of STRSNA CTRSNA DTRSNA ZTRSNA eigenvalues/vectors condition numbers of STRSEN CTRSEN DTRSEN ZTRSEN eigenvalue cluster/ invariant subspace -----------------------------------------------------------------------------Table 2.11: Computational routines for the nonsymmetric eigenproblem
Let A be a general real m-by-n matrix. The singular value decomposition (SVD) of A is the factorization , where U and V are orthogonal, and , r = min(m , n), with . If A is complex, then its SVD is where U and V are unitary, and is as before with real diagonal elements. The are called the singular values , the first r columns of V the right singular vectors and the first r columns of U the left singular vectors .
The routines described in this section, and listed in Table 2.12, are used to compute this decomposition. The computation proceeds in the following stages:
The reduction to bidiagonal form is performed by the subroutine xGEBRD, or by xGBBRD for a band matrix.
The routine xGEBRD represents and in factored form as products of elementary reflectors, as described in section 5.4. If A is real, the matrices and may be computed explicitly using routine xORGBR, or multiplied by other matrices without forming and using routine xORMBR . If A is complex, one instead uses xUNGBR and xUNMBR , respectively.
If A is banded and xGBBRD is used to reduce it to bidiagonal form, and are determined as products of Givens rotations , rather than as products of elementary reflectors. If or is required, it must be formed explicitly by xGBBRD. xGBBRD uses a vectorizable algorithm, similar to that used by xSBTRD (see Kaufman [57]). xGBBRD may be much faster than xGEBRD when the bandwidth is narrow.
The SVD of the bidiagonal matrix is computed by the subroutine xBDSQR. xBDSQR is more accurate than its counterparts in LINPACK and EISPACK: barring underflow and overflow, it computes all the singular values of A to nearly full relative precision, independent of their magnitudes. It also computes the singular vectors much more accurately. See section 4.9 and [41] [16] [22] for details.
If m >> n, it may be more efficient to first perform a QR factorization of A, using the routine xGEQRF , and then to compute the SVD of the n-by-n matrix R, since if A = QR and , then the SVD of A is given by . Similarly, if m << n, it may be more efficient to first perform an LQ factorization of A, using xGELQF. These preliminary QR and LQ factorizations are performed by the driver xGESVD.
The SVD may be used to find a minimum norm solution to a (possibly) rank-deficient linear least squares problem ( 2.1). The effective rank, k, of A can be determined as the number of singular values which exceed a suitable threshold. Let be the leading k-by-k submatrix of , and be the matrix consisting of the first k columns of V. Then the solution is given by:
where consists of the first k elements of . can be computed using xORMBR, and xBDSQR has an option to multiply a vector by .
----------------------------------------------------------------------------- Type of matrix Single precision Double precision and storage scheme Operation real complex real complex ----------------------------------------------------------------------------- general bidiagonal reduction SGEBRD CGEBRD DGEBRD ZGEBRD ----------------------------------------------------------------------------- general band bidiagonal reduction SGBBRD CGBBRD DGBBRD ZGBBRD ----------------------------------------------------------------------------- orthogonal/unitary generate matrix after SORGBR CUNGBR DORGBR ZUNGBR bidiagonal reduction multiply matrix after SORMBR CUNMBR DORMBR ZUNMBR bidiagonal reduction ----------------------------------------------------------------------------- bidiagonal singular values/ SBDSQR CBDSQR DBDSQR ZBDSQR singular vectors -----------------------------------------------------------------------------Table 2.12: Computational routines for the singular value decomposition
This section is concerned with the solution of the generalized eigenvalue problems , , and , where A and B are real symmetric or complex Hermitian and B is positive definite. Each of these problems can be reduced to a standard symmetric eigenvalue problem, using a Cholesky factorization of B as either or ( or in the Hermitian case). In the case , if A and B are banded then this may also be exploited to get a faster algorithm.
With , we have
Hence the eigenvalues of are those of , where C is the symmetric matrix and . In the complex case C is Hermitian with and .
Table 2.13 summarizes how each of the three types of problem may be reduced to standard form , and how the eigenvectors z of the original problem may be recovered from the eigenvectors y of the reduced problem. The table applies to real problems; for complex problems, transposed matrices must be replaced by conjugate-transposes.
Table 2.13: Reduction of generalized symmetric definite eigenproblems to standard
problems
Given A and a Cholesky factorization of B, the routines xyyGST overwrite A with the matrix C of the corresponding standard problem (see Table 2.14). This may then be solved using the routines described in subsection 2.3.4. No special routines are needed to recover the eigenvectors z of the generalized problem from the eigenvectors y of the standard problem, because these computations are simple applications of Level 2 or Level 3 BLAS.
If the problem is and the matrices A and B are banded, the matrix C as defined above is, in general, full. We can reduce the problem to a banded standard problem by modifying the definition of C thus:
where Q is an orthogonal matrix chosen to ensure that C has bandwidth no greater than that of A. Q is determined as a product of Givens rotations. This is known as Crawford's algorithm (see Crawford [14]). If X is required, it must be formed explicitly by the reduction routine.
A further refinement is possible when A and B are banded, which halves the amount of work required to form C (see Wilkinson [79]). Instead of the standard Cholesky factorization of B as or , we use a ``split Cholesky'' factorization ( if B is complex), where:
with upper triangular and lower triangular of order approximately n / 2; S has the same bandwidth as B. After B has been factorized in this way by the routine xPBSTF , the reduction of the banded generalized problem to a banded standard problem is performed by the routine xSBGST (or xHBGST for complex matrices). This routine implements a vectorizable form of the algorithm, suggested by Kaufman [57].
-------------------------------------------------------------------- Type of matrix Single precision Double precision and storage scheme Operation real complex real complex -------------------------------------------------------------------- symmetric/Hermitian reduction SSYGST CHEGST DSYGST ZHEGST -------------------------------------------------------------------- symmetric/Hermitian reduction SSPGST CHPGST DSPGST ZHPGST (packed storage) -------------------------------------------------------------------- symmetric/Hermitian split SPBSTF CPBSTF DPBSTF ZPBSTF banded Cholesky factorization -------------------------------------------------------------------- reduction SSBGST DSBGST CHBGST ZHBGST --------------------------------------------------------------------
Let A and B be n-by-n matrices. A scalar is called a generalized eigenvalue and a non-zero column vector x the corresponding right generalized eigenvector if . A non-zero column vector y satisfying (where the superscript H denotes conjugate-transpose) is called the left generalized eigenvector corresponding to . (For simplicity, we will usually omit the word ``generalized'' when no confusion is likely to arise.) If B is singular, we can have the infinite eigenvalue , by which we mean Bx = 0. Note that if A is non-singular, then the equivalent problem is perfectly well-behaved, and the infinite eigenvalue corresponds to . To deal with infinite eigenvalues, the LAPACK routines return two values, and , for each eigenvalue . The first basic task of these routines is to compute the all n pairs and x and/or y for a given pair of matrices A,B.
If the determinant of is zero for all values of , the eigenvalue problem is called singular, and is signaled by some (in the presence of roundoff, and may be very small). In this case the eigenvalue problem is very ill-conditioned, and in fact some of the other nonzero values of and may be indeterminate [43] [21] [80] [71].
The other basic task is to compute the generalized Schur decomposition of the pair A,B. If A and B are complex, then the pair's generalized Schur decomposition is , where Q and Z are unitary and S and P are upper triangular. The LAPACK routines normalize P to have non-negative diagonal entries. Note that in this form, the eigenvalues can be easily computed from the diagonals: , and so the LAPACK routines return and . The generalized Schur form depends on the order on which the eigenvalues appear on the diagonal. In a future version of LAPACK, we will supply routines to allow the user to choose this order.
If A and B are real, then the pair's generalized Schur decomposition is , , where Q and Z are orthogonal, P is upper triangular, and S is quasi-upper triangular with 1-by-1 and 2-by-2 blocks on the diagonal. The 1-by-1 blocks correspond to real generalized eigenvalues, while the 2-by-2 blocks correspond to complex conjugate pairs of generalized eigenvalues. In this case, P is normalized so that diagonal entries of P corresponding to 1-by-1 blocks of S are non-negative, while the (upper triangular) diagonal blocks of P corresponding to 2-by-2 blocks of S are made diagonal. Note that for real eigenvalues, as for all eigenvalues in the complex case, the and values corresponding to real eigenvalues may be easily computed from the diagonal of S and P. The and values corresponding to complex eigenvalues are computed by computing , then computing the values that would result if the 2-by-2 diagonal block of S,P were upper triangularized using unitary transformations , and finally multiplying to get and .
The columns of Q and Z are called generalized Schur vectors and span pairs of deflating subspaces of A and B [72]. Deflating subspaces are a generalization of invariant subspaces: The first k columns of Z span a right deflating subspace mapped by both A and B into a left deflating subspace spanned by the first k columns of Q. This pair of deflating subspaces corresponds to the first k eigenvalues appearing at the top of S and p.
The computations proceed in the following stages:
In addition, the routines xGGBAL and xGGBAK may be used to balance the pair A,B prior to reduction to generalized Hessenberg form. Balancing involves premultiplying A and B by one permutation and postmultiplying them by another, to try to make A,B as nearly triangular as possible, and then ``scaling'' the matrices by premultiplying A and B by one diagonal matrix and postmultiplying by another in order to make the rows and columns of A and B as close in norm to 1 as possible. These transformations can improve speed and accuracy of later processing in some cases; however, the scaling step can sometimes make things worse. Moreover, the scaling step will significantly change the generalized Schur form that results. xGGBAL performs the balancing, and xGGBAK back transforms the eigenvectors of the balanced matrix pair.
-------------------------------------------------------------------------- Type of matrix Single precision Double precision and storage scheme Operation real complex real complex -------------------------------------------------------------------------- general Hessenberg reduction SGGHRD CGGHRD DGGHRD ZGGHRD balancing SGGBAL CGGBAL DGGBAL ZGGBAL back transforming SGGBAK CGGBAK DGGBAK ZGGBAK -------------------------------------------------------------------------- Hessenberg Schur factorization SHGEQZ CHGEQZ DHGEQZ ZHGEQZ -------------------------------------------------------------------------- (quasi)triangular eigenvectors STGEVC CTGEVC DTGEVC ZTGEVC --------------------------------------------------------------------------Table 2.15: Computational routines for the generalized nonsymmetric eigenproblem
A future release of LAPACK will include the routines xTGEXC, xTGSYL, xTGSNA and xTGSEN, which are analogous to the routines xTREXC, xTRSYL, xTRSNA and xTRSEN. They will reorder eigenvalues in generalized Schur form, solve the generalized Sylvester equation, compute condition numbers of generalized eigenvalues and eigenvectors, and compute condition numbers of average eigenvalues and deflating subspaces.
The generalized (or quotient) singular value decomposition of an m-by-n matrix A and a p-by-n matrix B is described in section 2.2.5. The routines described in this section, are used to compute the decomposition. The computation proceeds in the following two stages:
where and are nonsingular upper triangular, and is upper triangular. If m - k - 1 < 0, the bottom zero block of does not appear, and is upper trapezoidal. , and are orthogonal matrices (or unitary matrices if A and B are complex). l is the rank of B, and k + l is the rank of .
Here , and are orthogonal (or unitary) matrices, C and S are both real nonnegative diagonal matrices satisfying , S is nonsingular, and R is upper triangular and nonsingular.
-------------------------------------------------------- Single precision Double precision Operation real complex real complex -------------------------------------------------------- triangular reduction SGGSVP CGGSVP DGGSVP ZGGSVP of A and B -------------------------------------------------------- GSVD of a pair of STGSJA CTGSJA DTGSJA ZTGSJA triangular matrices --------------------------------------------------------Table 2.16: Computational routines for the generalized singular value decomposition
The reduction to triangular form, performed by xGGSVP, uses QR decomposition with column pivoting for numerical rank determination. See [12] for details.
The generalized singular value decomposition of two triangular matrices, performed by xTGSJA, is done using a Jacobi-like method as described in [10] [62].
Note: this chapter presents some performance figures for LAPACK routines. The figures are provided for illustration only, and should not be regarded as a definitive up-to-date statement of performance. They have been selected from performance figures obtained in 1994 during the development of version 2.0 of LAPACK. All reported timings were obtained using the optimized version of the BLAS available on each machine. For the IBM computers, the ESSL BLAS were used. Performance is affected by many factors that may change from time to time, such as details of hardware (cycle time, cache size), compiler, and BLAS. To obtain up-to-date performance figures, use the timing programs provided with LAPACK.
Can we provide portable software for computations in dense linear algebra that is efficient on a wide range of modern high-performance computers? If so, how? Answering these questions - and providing the desired software - has been the goal of the LAPACK project.
LINPACK [26] and EISPACK [44] [70] have for many years provided high-quality portable software for linear algebra; but on modern high-performance computers they often achieve only a small fraction of the peak performance of the machines. Therefore, LAPACK has been designed to supersede LINPACK and EISPACK, principally by achieving much greater efficiency - but at the same time also adding extra functionality, using some new or improved algorithms, and integrating the two sets of algorithms into a single package.
LAPACK was originally targeted to achieve good performance on single-processor vector machines and on shared memory multiprocessor machines with a modest number of powerful processors. Since the start of the project, another class of machines has emerged for which LAPACK software is equally well-suited-the high-performance ``super-scalar'' workstations . (LAPACK is intended to be used across the whole spectrum of modern computers, but when considering performance, the emphasis is on machines at the more powerful end of the spectrum.)
Here we discuss the main factors that affect the performance of linear algebra software on these classes of machines.
Designing vectorizable algorithms in linear algebra is usually straightforward. Indeed, for many computations there are several variants, all vectorizable, but with different characteristics in performance (see, for example, [33]). Linear algebra algorithms can come close to the peak performance of many machines - principally because peak performance depends on some form of chaining of vector addition and multiplication operations, and this is just what the algorithms require.
However, when the algorithms are realized in straightforward Fortran 77 code, the performance may fall well short of the expected level, usually because vectorizing Fortran compilers fail to minimize the number of memory references - that is, the number of vector load and store operations. This brings us to the next factor.
What often limits the actual performance of a vector-or scalar- floating-point unit is the rate of transfer of data between different levels of memory in the machine. Examples include: the transfer of vector operands in and out of vector registers , the transfer of scalar operands in and out of a high-speed scalar processor, the movement of data between main memory and a high-speed cache or local memory , and paging between actual memory and disk storage in a virtual memory system.
It is desirable to maximize the ratio of floating-point operations to memory references, and to re-use data as much as possible while it is stored in the higher levels of the memory hierarchy (for example, vector registers or high-speed cache).
A Fortran programmer has no explicit control over these types of data movement, although one can often influence them by imposing a suitable structure on an algorithm.
The nested loop structure of most linear algebra algorithms offers considerable scope for loop-based parallelism on shared memory machines. This is the principal type of parallelism that LAPACK at present aims to exploit. It can sometimes be generated automatically by a compiler, but often requires the insertion of compiler directives .
How then can we hope to be able to achieve sufficient control over vectorization, data movement, and parallelism in portable Fortran code, to obtain the levels of performance that machines can offer?
The LAPACK strategy for combining efficiency with portability is to construct the software as much as possible out of calls to the BLAS (Basic Linear Algebra Subprograms); the BLAS are used as building blocks.
The efficiency of LAPACK software depends on efficient implementations of the BLAS being provided by computer vendors (or others) for their machines. Thus the BLAS form a low-level interface between LAPACK software and different machine architectures. Above this level, almost all of the LAPACK software is truly portable.
There are now three levels of BLAS:
Here, A, B and C are matrices, x and y are vectors, and and are scalars.
The Level 1 BLAS are used in LAPACK, but for convenience rather than for performance: they perform an insignificant fraction of the computation, and they cannot achieve high efficiency on most modern supercomputers.
The Level 2 BLAS can achieve near-peak performance on many vector processors, such as a single processor of a CRAY Y-MP, CRAY C90, or CONVEX C4 machine. However on other vector processors, such as a CRAY 2, or a RISC workstation, their performance is limited by the rate of data movement between different levels of memory.
This limitation is overcome by the Level 3 BLAS , which perform floating-point operations on data, whereas the Level 2 BLAS perform only operations on data.
The BLAS also allow us to exploit parallelism in a way that is transparent to the software that calls them. Even the Level 2 BLAS offer some scope for exploiting parallelism, but greater scope is provided by the Level 3 BLAS, as Table 3.1 illustrates.
Table 3.1: Speed in megaflops of Level 2 and Level 3 BLAS operations on a
CRAY C90
It is comparatively straightforward to recode many of the algorithms in LINPACK and EISPACK so that they call Level 2 BLAS . Indeed, in the simplest cases the same floating-point operations are performed, possibly even in the same order: it is just a matter of reorganizing the software. To illustrate this point we derive the Cholesky factorization algorithm that is used in the LINPACK routine SPOFA , which factorizes a symmetric positive definite matrix as . Writing these equations as:
and equating coefficients of the j-th column, we obtain:
Hence, if
has already been computed, we can compute
and
from the equations:
Here is the body of the code of the LINPACK routine SPOFA , which implements the above method:
DO 30 J = 1, N INFO = J S = 0.0E0 JM1 = J - 1 IF (JM1 .LT. 1) GO TO 20 DO 10 K = 1, JM1 T = A(K,J) - SDOT(K-1,A(1,K),1,A(1,J),1) T = T/A(K,K) A(K,J) = T S = S + T*T 10 CONTINUE 20 CONTINUE S = A(J,J) - S C ......EXIT IF (S .LE. 0.0E0) GO TO 40 A(J,J) = SQRT(S) 30 CONTINUE
And here is the same computation recoded in ``LAPACK-style'' to use the Level 2 BLAS routine STRSV (which solves a triangular system of equations). The call to STRSV has replaced the loop over K which made several calls to the Level 1 BLAS routine SDOT. (For reasons given below, this is not the actual code used in LAPACK - hence the term ``LAPACK-style''.)
DO 10 J = 1, N CALL STRSV( 'Upper', 'Transpose', 'Non-unit', J-1, A, LDA, $ A(1,J), 1 ) S = A(J,J) - SDOT( J-1, A(1,J), 1, A(1,J), 1 ) IF( S.LE.ZERO ) GO TO 20 A(J,J) = SQRT( S ) 10 CONTINUE
This change by itself is sufficient to make big gains in performance on a number of machines.
For example, on an IBM RISC Sys/6000-550 (using double precision) there is virtually no difference in performance between the LINPACK-style and the LAPACK-style code. Both styles run at a megaflop rate far below its peak performance for matrix-matrix multiplication. To exploit the faster speed of Level 3 BLAS , the algorithms must undergo a deeper level of restructuring, and be re-cast as a block algorithm - that is, an algorithm that operates on blocks or submatrices of the original matrix.
To derive a block form of Cholesky factorization , we write the defining equation in partitioned form thus:
Equating submatrices in the second block of columns, we obtain:
Hence, if
has already been computed, we can compute
as the solution to the equation
by a call to the Level 3 BLAS routine STRSM; and then we can compute
from
This involves first updating the symmetric submatrix
by a call to the Level 3 BLAS routine SSYRK, and then computing its Cholesky factorization. Since Fortran does not allow recursion, a separate routine must be called (using Level 2 BLAS rather than Level 3), named SPOTF2 in the code below. In this way successive blocks of columns of U are computed. Here is LAPACK-style code for the block algorithm. In this code-fragment NB denotes the width of the blocks.
DO 10 J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL STRSM( 'Left', 'Upper', 'Transpose', 'Non-unit', J-1, JB, $ ONE, A, LDA, A( 1, J ), LDA ) CALL SSYRK( 'Upper', 'Transpose', JB, J-1, -ONE, A( 1, J ), LDA, $ ONE, A( J, J ), LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ), LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10 CONTINUE
But that is not the end of the story, and the code given above is not the code that is actually used in the LAPACK routine SPOTRF . We mentioned in subsection 3.1.1 that for many linear algebra computations there are several vectorizable variants, often referred to as i-, j- and k-variants, according to a convention introduced in [33] and used in [45]. The same is true of the corresponding block algorithms.
It turns out that the j-variant that was chosen for LINPACK, and used in the above examples, is not the fastest on many machines, because it is based on solving triangular systems of equations, which can be significantly slower than matrix-matrix multiplication. The variant actually used in LAPACK is the i-variant, which does rely on matrix-matrix multiplication.
Having discussed in detail the derivation of one particular block algorithm, we now describe examples of the performance that has been achieved with a variety of block algorithms. The clock speeds for the computers involved in the timings are listed in Table 3.2.
------------------------------------------- Clock Speed ------------------------------------------- CONVEX C-4640 135 MHz 7.41 ns CRAY C90 240 MHz 4.167 ns DEC 3000-500X Alpha 200 MHz 5.0 ns IBM POWER2 model 590 66 MHz 15.15 ns IBM RISC Sys/6000-550 42 MHz 23.81 ns SGI POWER CHALLENGE 75 MHz 13.33 ns -------------------------------------------
See Gallivan et al. [42] and Dongarra et al. [31] for an alternative survey of algorithms for dense linear algebra on high-performance computers.
The well-known LU and Cholesky factorizations are the simplest block algorithms to derive. No extra floating-point operations nor extra working storage are required.
Table 3.3 illustrates the speed of the LAPACK routine for LU factorization of a real matrix , SGETRF in single precision on CRAY machines, and DGETRF in double precision on all other machines. This corresponds to 64-bit floating-point arithmetic on all machines tested. A block size of 1 means that the unblocked algorithm is used, since it is faster than - or at least as fast as - a blocked algorithm.
--------------------------------------------------- No. of Block Values of n processors size 100 1000 --------------------------------------------------- CONVEX C-4640 1 64 274 711 CONVEX C-4640 4 64 379 2588 CRAY C90 1 128 375 863 CRAY C90 16 128 386 7412 DEC 3000-500X Alpha 1 32 53 91 IBM POWER2 model 590 1 32 110 168 IBM RISC Sys/6000-550 1 32 33 56 SGI POWER CHALLENGE 1 64 81 201 SGI POWER CHALLENGE 4 64 79 353 ---------------------------------------------------Table 3.3: Speed in megaflops of SGETRF/DGETRF for square matrices of order n
Table 3.4 gives similar results for Cholesky factorization .
--------------------------------------------------- No. of Block Values of n processors size 100 1000 --------------------------------------------------- CONVEX C-4640 1 64 120 546 CONVEX C-4640 4 64 150 1521 CRAY C90 1 128 324 859 CRAY C90 16 128 453 9902 DEC 3000-500X Alpha 1 32 37 83 IBM POWER2 model 590 1 32 102 247 IBM RISC Sys/6000-550 1 32 40 72 SGI POWER CHALLENGE 1 64 74 199 SGI POWER CHALLENGE 4 64 69 424 ---------------------------------------------------Table 3.4: Speed in megaflops of SPOTRF/DPOTRF for matrices of order n with UPLO = `U'
LAPACK, like LINPACK, provides a factorization for symmetric indefinite matrices, so that A is factorized as , where P is a permutation matrix, and D is block diagonal with blocks of order 1 or 2. A block form of this algorithm has been derived, and is implemented in the LAPACK routine SSYTRF /DSYTRF . It has to duplicate a little of the computation in order to ``look ahead'' to determine the necessary row and column interchanges, but the extra work can be more than compensated for by the greater speed of updating the matrix by blocks as is illustrated in Table 3.5 .
------------------- Block Values of n size 100 1000 ------------------- 1 62 86 64 68 165 -------------------Table 3.5: Speed in megaflops of DSYTRF for matrices of order n with UPLO = `U' on an IBM POWER2 model 590
LAPACK, like LINPACK, provides LU and Cholesky factorizations of band matrices. The LINPACK algorithms can easily be restructured to use Level 2 BLAS, though that has little effect on performance for matrices of very narrow bandwidth. It is also possible to use Level 3 BLAS, at the price of doing some extra work with zero elements outside the band [39]. This becomes worthwhile for matrices of large order and semi-bandwidth greater than 100 or so.
The traditional algorithm for QR factorization is based on the use of elementary Householder matrices of the general form
where v is a column vector and
is a scalar. This leads to an algorithm with very good vector performance, especially if coded to use Level 2 BLAS.
The key to developing a block form of this algorithm is to represent a product of b elementary Householder matrices of order n as a block form of a Householder matrix . This can be done in various ways. LAPACK uses the following form [68]:
where V is an n-by-n matrix whose columns are the individual vectors
associated with the Householder matrices
, and T is an upper triangular matrix of order b. Extra work is required to compute the elements of T, but once again this is compensated for by the greater speed of applying the block form. Table 3.6 summarizes results obtained with the LAPACK routine SGEQRF /DGEQRF .
------------------------------------------------- No. of Block Values of n processors size 100 1000 ------------------------------------------------- CONVEX C-4640 1 64 81 521 CONVEX C-4640 4 64 94 1204 CRAY C90 1 128 384 859 CRAY C90 16 128 390 7641 DEC 3000-500X Alpha 1 32 50 86 IBM POWER2 model 590 1 32 108 208 IBM RISC Sys/6000-550 1 32 30 61 SGI POWER CHALLENGE 1 64 61 190 SGI POWER CHALLENGE 4 64 39 342 -------------------------------------------------
Eigenvalue problems have until recently provided a less fertile ground for the development of block algorithms than the factorizations so far described. Version 2.0 of LAPACK includes new block algorithms for the symmetric eigenvalue problem, and future releases will include analogous algorithms for the singular value decomposition.
The first step in solving many types of eigenvalue problems is to reduce the original matrix to a ``condensed form'' by orthogonal transformations .
In the reduction to condensed forms, the unblocked algorithms all use elementary Householder matrices and have good vector performance. Block forms of these algorithms have been developed [34], but all require additional operations, and a significant proportion of the work must still be performed by Level 2 BLAS, so there is less possibility of compensating for the extra operations.
The algorithms concerned are:
using the Level 3 BLAS routine SSYR2K; Level 3 BLAS account for at most half the work.
using two calls to the Level 3 BLAS routine SGEMM; Level 3 BLAS account for at most half the work.
Level 3 BLAS account for at most three-quarters of the work.
Note that only in the reduction to Hessenberg form is it possible to use the block Householder representation described in subsection 3.4.2. Extra work must be performed to compute the n-by-b matrices X and Y that are required for the block updates (b is the block size) - and extra workspace is needed to store them.
Nevertheless, the performance gains can be worthwhile on some machines, for example, on an IBM POWER2 model 590, as shown in Table 3.7.
(all matrices are square of order n) ---------------------------- Block Values of n size 100 1000 ---------------------------- DSYTRD 1 137 159 16 82 169 ---------------------------- DGEBRD 1 90 110 16 90 136 ---------------------------- DGEHRD 1 111 113 16 125 187 ----------------------------Table 3.7: Speed in megaflops of reductions to condensed forms on an IBM POWER2 model 590
Following the reduction of a dense (or band) symmetric matrix to tridiagonal form T, we must compute the eigenvalues and (optionally) eigenvectors of T. Computing the eigenvalues of T alone (using LAPACK routine SSTERF ) requires flops, whereas the reduction routine SSYTRD does flops. So eventually the cost of finding eigenvalues alone becomes small compared to the cost of reduction. However, SSTERF does only scalar floating point operations, without scope for the BLAS, so n may have to be large before SSYTRD is slower than SSTERF.
Version 2.0 of LAPACK includes a new algorithm, SSTEDC , for finding all eigenvalues and eigenvectors of n. The new algorithm can exploit Level 2 and 3 BLAS, whereas the previous algorithm, SSTEQR , could not. Furthermore, SSTEDC usually does many fewer flops than SSTEQR, so the speedup is compounded. Briefly, SSTEDC works as follows (for details, see [67] [47]). The tridiagonal matrix T is written as
where and are tridiagonal, and H is a very simple rank-one matrix. Then the eigenvalues and eigenvectors of and are found by applying the algorithm recursively; this yields and , where is a diagonal matrix of eigenvalues, and the columns of are orthonormal eigenvectors. Thus
where is again a simple rank-one matrix. The eigenvalues and eigenvectors of may be found using scalar operations, yielding Substituting this into the last displayed expression yields
where the diagonals of are the desired eigenvalues of T, and the columns of are the eigenvectors. Almost all the work is done in the two matrix multiplies of and times , which is done using the Level 3 BLAS.
The same recursive algorithm can be developed for the singular value decomposition of the bidiagonal matrix resulting from reducing a dense matrix with SGEBRD. This software will be completed for a future release of LAPACK. The current LAPACK algorithm for the bidiagonal singular values decomposition, SBDSQR , does not use the Level 2 or Level 3 BLAS.
For computing the eigenvalues and eigenvectors of a Hessenberg matrix-or rather for computing its Schur factorization- yet another flavour of block algorithm has been developed: a multishift QR iteration [8]. Whereas the traditional EISPACK routine HQR uses a double shift (and the corresponding complex routine COMQR uses a single shift), the multishift algorithm uses block shifts of higher order. It has been found that often the total number of operations decreases as the order of shift is increased until a minimum is reached typically between 4 and 8; for higher orders the number of operations increases quite rapidly. On many machines the speed of applying the shift increases steadily with the order, and the optimum order of shift is typically in the range 8-16. Note however that the performance can be very sensitive to the choice of the order of shift; it also depends on the numerical properties of the matrix. Dubrulle [37] has studied the practical performance of the algorithm, while Watkins and Elsner [77] discuss its theoretical asymptotic convergence rate.
Finally, we note that research into block algorithms for symmetric and nonsymmetric eigenproblems continues [55] [9], and future versions of LAPACK will be updated to contain the best algorithms available.
LAPACK is a library of Fortran 77 subroutines for solving the most commonly occurring problems in numerical linear algebra. It has been designed to be efficient on a wide range of modern high-performance computers. The name LAPACK is an acronym for Linear Algebra PACKage.
This section contains performance numbers for selected LAPACK driver routines. These routines provide complete solutions for the most common problems of numerical linear algebra, and are the routines users are most likely to call:
Data is provided for a variety of vector computers, shared memory parallel computers, and high performance workstations. All timings were obtained by using the machine-specific optimized BLAS available on each machine. For the IBM RISC Sys/6000-550 and IBM POWER2 model 590, the ESSL BLAS were used. In all cases the data consisted of 64-bit floating point numbers (single precision on the CRAY C90 and double precision on the other machines). For each machine and each driver, a small problem (N = 100 with LDA = 101) and a large problem (N = 1000 with LDA = 1001) were run. Block sizes NB = 1, 16, 32 and 64 were tried, with data only for the fastest run reported in the tables below. Similarly, UPLO = 'L' and UPLO = 'U' were timed for SSYEVD/DSYEVD, but only times for UPLO = 'U' were reported. For SGEEV/DGEEV, ILO = 1 and IHI = N. The test matrices were generated with randomly distributed entries. All run times are reported in seconds, and block size is denoted by nb. The value of nb was chosen to make N = 1000 optimal. It is not necessarily the best choice for N = 100. See Section 6.2 for details.
The performance data is reported using three or four statistics. First, the run-time in seconds is given. The second statistic measures how well our performance compares to the speed of the BLAS, specifically SGEMM/DGEMM. This ``equivalent matrix multiplies'' statistic is calculated as
and labeled as
in the tables.
The performance information for the BLAS routines
SGEMV/DGEMV (TRANS='N')
and SGEMM/DGEMM (TRANSA='N', TRANSB='N') is provided in Table
3.8,
along with the clock speed for each machine in Table
3.2.
The third statistic is the true megaflop rating. For
the eigenvalue and singular value drivers, a fourth ``synthetic megaflop''
statistic is also presented. We provide this statistic because the number
of floating point operations needed to find eigenvalues and singular values
depends on the input data, unlike linear equation solving or linear least
squares solving with SGELS/DGELS. The synthetic megaflop rating is defined
to be the ``standard'' number of flops required to solve the problem, divided
by the run-time in microseconds. This ``standard'' number of flops is taken
to be the average for a standard algorithm over a variety of problems, as
given in Table
3.9
(we ignore terms of order
)
[45].
Table 3.8: Execution time and Megaflop rates for SGEMV/DGEMV and
SGEMM/DGEMM
Note that the synthetic megaflop rating is much higher than the true megaflop
rating for
SSYEVD/DSYEVD in Table
3.15; this is because SSYEVD/DSYEVD
performs many fewer floating point operations than the standard algorithm, SSYEV/DSYEV.
Table 3.9: ``Standard'' floating point operation counts for LAPACK drivers
for n-by-n matrices
Table 3.10: Performance of SGESV/DGESV for n-by-n matrices
Table 3.11: Performance of SGELS/DGELS for n-by-n matrices
Table 3.12: Performance of SGEEV/DGEEV, eigenvalues only
Table 3.13: Performance of SGEEV/DGEEV, eigenvalues and right eigenvectors
Table 3.14: Performance of SSYEVD/DSYEVD, eigenvalues only, UPLO='U'
Table 3.15: Performance of SSYEVD/DSYEVD, eigenvalues and eigenvectors, UPLO='U'
Table 3.16: Performance of SGESVD/DGESVD, singular values only
Table 3.17: Performance of SGESVD/DGESVD, singular values and left and right singular vectors
In addition to providing faster routines than previously available, LAPACK provides more comprehensive and better error bounds . Our ultimate goal is to provide error bounds for all quantities computed by LAPACK.
In this chapter we explain our overall approach to obtaining error bounds, and provide enough information to use the software. The comments at the beginning of the individual routines should be consulted for more details. It is beyond the scope of this chapter to justify all the bounds we present. Instead, we give references to the literature. For example, standard material on error analysis can be found in [45].
In order to make this chapter easy to read, we have labeled sections not essential for a first reading as Further Details. The sections not labeled as Further Details should provide all the information needed to understand and use the main error bounds computed by LAPACK. The Further Details sections provide mathematical background, references, and tighter but more expensive error bounds, and may be read later.
In section 4.1 we discuss the sources of numerical error, in particular roundoff error. Section 4.2 discusses how to measure errors, as well as some standard notation. Section 4.3 discusses further details of how error bounds are derived. Sections 4.4 through 4.12 present error bounds for linear equations, linear least squares problems, generalized linear least squares problems, the symmetric eigenproblem, the nonsymmetric eigenproblem, the singular value decomposition, the generalized symmetric definite eigenproblem, the generalized nonsymmetric eigenproblem and the generalized (or quotient) singular value decomposition respectively. Section 4.13 discusses the impact of fast Level 3 BLAS on the accuracy of LAPACK routines.
The sections on generalized linear least squares problems and the generalized nonsymmetric eigenproblem are ``placeholders'' to be completed in the next versions of the library and manual. The next versions will also include error bounds for new high accuracy routines for the symmetric eigenvalue problem and singular value decomposition.
There are two sources of error whose effects can be measured by the bounds in this chapter: roundoff error and input error. Roundoff error arises from rounding results of floating-point operations during the algorithm. Input error is error in the input to the algorithm from prior calculations or measurements. We describe roundoff error first, and then input error.
Almost all the error bounds LAPACK provides are multiples of machine epsilon, which we abbreviate by . Machine epsilon bounds the roundoff in individual floating-point operations. It may be loosely defined as the largest relative error in any floating-point operation that neither overflows nor underflows. (Overflow means the result is too large to represent accurately, and underflow means the result is too small to represent accurately.) Machine epsilon is available either by the function call SLAMCH('Epsilon') (or simply SLAMCH('E')) in single precision, or by the function call DLAMCH('Epsilon') (or DLAMCH('E')) in double precision. See section 4.1.1 and Table 4.1 for a discussion of common values of machine epsilon.
Since overflow generally causes an error message, and underflow is almost always less significant than roundoff, we will not consider overflow and underflow further (see section 4.1.1).
Bounds on input errors, or errors in the input parameters inherited from prior computations or measurements, may be easily incorporated into most LAPACK error bounds. Suppose the input data is accurate to, say, 5 decimal digits (we discuss exactly what this means in section 4.2). Then one simply replaces by in the error bounds.
Roundoff error is bounded in terms of the machine precision , which is the smallest value satisfying
where and are floating-point numbers , is any one of the four operations +, , and , and is the floating-point result of . Machine epsilon, , is the smallest value for which this inequality is true for all , and for all and such that is neither too large (magnitude exceeds the overflow threshold) nor too small (is nonzero with magnitude less than the underflow threshold) to be represented accurately in the machine. We also assume bounds the relative error in unary operations like square root:
A precise characterization of depends on the details of the machine arithmetic and sometimes even of the compiler. For example, if addition and subtraction are implemented without a guard digit we must redefine to be the smallest number such that
In order to assure portability , machine parameters such as machine epsilon, the overflow threshold and underflow threshold are computed at runtime by the auxiliary routine xLAMCH . The alternative, keeping a fixed table of machine parameter values, would degrade portability because the table would have to be changed when moving from one machine, or even one compiler, to another.
Actually, most machines, but not yet all, do have the same machine parameters because they implement IEEE Standard Floating Point Arithmetic [5] [4], which exactly specifies floating-point number representations and operations. For these machines, including all modern workstations and PCs , the values of these parameters are given in Table 4.1.
Table 4.1: Values of Machine Parameters in IEEE Floating Point Arithmetic
As stated above, we will ignore overflow and underflow in discussing error bounds. Reference [18] discusses extending error bounds to include underflow, and shows that for many common computations, when underflow occurs it is less significant than roundoff. Overflow generally causes an error message and stops execution, so the error bounds do not apply .
Therefore, most of our error bounds will simply be proportional to machine epsilon. This means, for example, that if the same problem in solved in double precision and single precision, the error bound in double precision will be smaller than the error bound in single precision by a factor of . In IEEE arithmetic, this ratio is , meaning that one expects the double precision answer to have approximately nine more decimal digits correct than the single precision answer.
LAPACK routines are generally insensitive to the details of rounding, like their counterparts in LINPACK and EISPACK. One newer algorithm (xLASV2) can return significantly more accurate results if addition and subtraction have a guard digit (see the end of section 4.9).
LAPACK routines return four types of floating-point output arguments:
First consider scalars. Let the scalar be an approximation of the true answer . We can measure the difference between and either by the absolute error , or, if is nonzero, by the relative error . Alternatively, it is sometimes more convenient to use instead of the standard expression for relative error (see section 4.2.1). If the relative error of is, say , then we say that is accurate to 5 decimal digits.
In order to measure the error in vectors, we need to measure the size or norm of a vector x . A popular norm is the magnitude of the largest component, , which we denote . This is read the infinity norm of x. See Table 4.2 for a summary of norms.
Table 4.2: Vector and matrix norms
If is an approximation to the exact vector x, we will refer to as the absolute error in (where p is one of the values in Table 4.2), and refer to as the relative error in (assuming ). As with scalars, we will sometimes use for the relative error. As above, if the relative error of is, say , then we say that is accurate to 5 decimal digits. The following example illustrates these ideas:
Thus, we would say that approximates x to 2 decimal digits.
Errors in matrices may also be measured with norms . The most obvious generalization of to matrices would appear to be , but this does not have certain important mathematical properties that make deriving error bounds convenient (see section 4.2.1). Instead, we will use , where A is an m-by-n matrix, or ; see Table 4.2 for other matrix norms. As before is the absolute error in , is the relative error in , and a relative error in of means is accurate to 5 decimal digits. The following example illustrates these ideas:
so is accurate to 1 decimal digit.
Here is some related notation we will use in our error bounds. The condition number of a matrix A is defined as , where A is square and invertible, and p is or one of the other possibilities in Table 4.2. The condition number measures how sensitive is to changes in A; the larger the condition number, the more sensitive is . For example, for the same A as in the last example,
LAPACK error estimation routines typically compute a variable called RCOND , which is the reciprocal of the condition number (or an approximation of the reciprocal). The reciprocal of the condition number is used instead of the condition number itself in order to avoid the possibility of overflow when the condition number is very large. Also, some of our error bounds will use the vector of absolute values of x, ( ), or similarly ( ).
Now we consider errors in subspaces. Subspaces are the outputs of routines that compute eigenvectors and invariant subspaces of matrices. We need a careful definition of error in these cases for the following reason. The nonzero vector x is called a (right) eigenvector of the matrix A with eigenvalue if . From this definition, we see that -x, 2x, or any other nonzero multiple of x is also an eigenvector. In other words, eigenvectors are not unique. This means we cannot measure the difference between two supposed eigenvectors and x by computing , because this may be large while is small or even zero for some . This is true even if we normalize n so that , since both x and -x can be normalized simultaneously. So in order to define error in a useful way, we need to instead consider the set S of all scalar multiples of x. The set S is called the subspace spanned by x, and is uniquely determined by any nonzero member of S. We will measure the difference between two such sets by the acute angle between them. Suppose is spanned by and S is spanned by {x}. Then the acute angle between and S is defined as
One can show that does not change when either or x is multiplied by any nonzero scalar. For example, if
as above, then for any nonzero scalars and .
Here is another way to interpret the angle between and S. Suppose is a unit vector ( ). Then there is a scalar such that
The approximation holds when is much less than 1 (less than .1 will do nicely). If is an approximate eigenvector with error bound , where x is a true eigenvector, there is another true eigenvector satisfying . For example, if
then for .
Some LAPACK routines also return subspaces spanned by more than one vector, such as the invariant subspaces of matrices returned by xGEESX. The notion of angle between subspaces also applies here; see section 4.2.1 for details.
Finally, many of our error bounds will contain a factor p(n) (or p(m , n)), which grows as a function of matrix dimension n (or dimensions m and n). It represents a potentially different function for each problem. In practice, the true errors usually grow just linearly; using p(n) = 10n in the error bound formulas will often give a reasonable bound. Therefore, we will refer to p(n) as a ``modestly growing'' function of n. However it can occasionally be much larger, see section 4.2.1. For simplicity, the error bounds computed by the code fragments in the following sections will use p(n) = 1. This means these computed error bounds may occasionally slightly underestimate the true error. For this reason we refer to these computed error bounds as ``approximate error bounds''.
The relative error in the approximation of the true solution has a drawback: it often cannot be computed directly, because it depends on the unknown quantity . However, we can often instead estimate , since is known (it is the output of our algorithm). Fortunately, these two quantities are necessarily close together, provided either one is small, which is the only time they provide a useful bound anyway. For example, implies
so they can be used interchangeably.
Table 4.2 contains a variety of norms we will use to measure errors. These norms have the properties that , and , where p is one of 1, 2, , and F. These properties are useful for deriving error bounds.
An error bound that uses a given norm may be changed into an error bound that uses another norm. This is accomplished by multiplying the first error bound by an appropriate function of the problem dimension. Table 4.3 gives the factors such that , where n is the dimension of x.
Table 4.3: Bounding One Vector Norm in Terms of Another
Table 4.4 gives the factors such that , where A is m-by-n.
Table 4.4: Bounding One Matrix Norm in Terms of Another
The two-norm of A, , is also called the spectral norm of A, and is equal to the largest singular value of A. We shall also need to refer to the smallest singular value of A; its value can be defined in a similar way to the definition of the two-norm in Table 4.2, namely as when A has at least as many rows as columns, and defined as when A has more columns than rows. The two-norm, Frobenius norm , and singular values of a matrix do not change if the matrix is multiplied by a real orthogonal (or complex unitary) matrix.
Now we define subspaces spanned by more than one vector, and angles between subspaces. Given a set of k n-dimensional vectors , they determine a subspace S consisting of all their possible linear combinations , scalars . We also say that spans S. The difficulty in measuring the difference between subspaces is that the sets of vectors spanning them are not unique. For example, {x}, {-x} and {2x} all determine the same subspace. This means we cannot simply compare the subspaces spanned by and by comparing each to . Instead, we will measure the angle between the subspaces, which is independent of the spanning set of vectors. Suppose subspace is spanned by and that subspace S is spanned by . If k = 1, we instead write more simply and {x}. When k = 1, we defined the angle between and S as the acute angle between and . When k > 1, we define the acute angle between and S as the largest acute angle between any vector in , and the closest vector x in S to :
LAPACK routines which compute subspaces return vectors spanning a subspace which are orthonormal. This means the n-by-k matrix satisfies . Suppose also that the vectors spanning S are orthonormal, so also satisfies . Then there is a simple expression for the angle between and S:
For example, if
then .
As stated above, all our bounds will contain a factor p(n) (or p(m,n)), which measure how roundoff errors can grow as a function of matrix dimension n (or m and n). In practice, the true error usually grows just linearly with n, but we can generally only prove much weaker bounds of the form . This is because we can not rule out the extremely unlikely possibility of rounding errors all adding together instead of canceling on average. Using would give very pessimistic and unrealistic bounds, especially for large n, so we content ourselves with describing p(n) as a ``modestly growing'' polynomial function of n. Using p(n) = 10n in the error bound formulas will often give a reasonable bound. For detailed derivations of various p(n), see [78] [45].
There is also one situation where p(n) can grow as large as : Gaussian elimination. This typically occurs only on specially constructed matrices presented in numerical analysis courses [p. 212]wilkinson1. However, the expert drivers for solving linear systems, xGESVX and xGBSVX, provide error bounds incorporating p(n), and so this rare possibility can be detected.
We illustrate standard error analysis with the simple example of evaluating the scalar function y = f(z). Let the output of the subroutine which implements f(z) be denoted alg(z); this includes the effects of roundoff. If where is small, then we say alg is a backward stable algorithm for f, or that the backward error is small. In other words, alg(z) is the exact value of f at a slightly perturbed input .
Suppose now that f is a smooth function, so that we may approximate it near z by a straight line: . Then we have the simple error estimate
Thus, if is small, and the derivative is moderate, the error alg(z) - f(z) will be small . This is often written in the similar form
This approximately bounds the relative error by the product of the condition number of f at z, , and the relative backward error . Thus we get an error bound by multiplying a condition number and a backward error (or bounds for these quantities). We call a problem ill-conditioned if its condition number is large, and ill-posed if its condition number is infinite (or does not exist) .
If f and z are vector quantities, then is a matrix (the Jacobian). So instead of using absolute values as before, we now measure by a vector norm and by a matrix norm . The conventional (and coarsest) error analysis uses a norm such as the infinity norm. We therefore call this normwise backward stability. For example, a normwise stable method for solving a system of linear equations Ax = b will produce a solution satisfying where and are both small (close to machine epsilon). In this case the condition number is (see section 4.4 below).
Almost all of the algorithms in LAPACK (as well as LINPACK and EISPACK) are stable in the sense just described : when applied to a matrix A they produce the exact result for a slightly different matrix A + E, where is of order .
Condition numbers may be expensive to compute exactly. For example, it costs about operations to solve Ax = b for a general matrix A, and computing exactly costs an additional operations, or twice as much. But can be estimated in only operations beyond those necessary for solution, a tiny extra cost. Therefore, most of LAPACK's condition numbers and error bounds are based on estimated condition numbers , using the method of [52] [51] [48]. The price one pays for using an estimated rather than an exact condition number is occasional (but very rare) underestimates of the true error; years of experience attest to the reliability of our estimators, although examples where they badly underestimate the error can be constructed [53]. Note that once a condition estimate is large enough, (usually ), it confirms that the computed answer may be completely inaccurate, and so the exact magnitude of the condition estimate conveys little information.
The standard error analysis just outlined has a drawback: by using the infinity norm to measure the backward error, entries of equal magnitude in contribute equally to the final error bound . This means that if z is sparse or has some very tiny entries, a normwise backward stable algorithm may make very large changes in these entries compared to their original values. If these tiny values are known accurately by the user, these errors may be unacceptable, or the error bounds may be unacceptably large.
For example, consider solving a diagonal system of linear equations Ax = b. Each component of the solution is computed accurately by Gaussian elimination: . The usual error bound is approximately , which can arbitrarily overestimate the true error, , if at least one is tiny and another one is large.
LAPACK addresses this inadequacy by providing some algorithms whose backward error is a tiny relative change in each component of z: . This backward error retains both the sparsity structure of z as well as the information in tiny entries. These algorithms are therefore called componentwise relatively backward stable. Furthermore, computed error bounds reflect this stronger form of backward error .
If the input data has independent uncertainty in each component, each component must have at least a small relative uncertainty, since each is a floating-point number. In this case, the extra uncertainty contributed by the algorithm is not much worse than the uncertainty in the input data, so one could say the answer provided by a componentwise relatively backward stable algorithm is as accurate as the data warrants [1].
When solving Ax = b using expert driver xyySVX or computational routine xyyRFS, for example, we almost always compute satisfying , where is a small relative change in and is a small relative change in . In particular, if A is diagonal, the corresponding error bound is always tiny, as one would expect (see the next section).
LAPACK can achieve this accuracy for linear equation solving, the bidiagonal singular value decomposition, and the symmetric tridiagonal eigenproblem, and provides facilities for achieving this accuracy for least squares problems. Future versions of LAPACK will also achieve this accuracy for other linear algebra problems, as discussed below.
Let Ax = b be the system to be solved, and the computed solution. Let n be the dimension of A. An approximate error bound for may be obtained in one of the following two ways, depending on whether the solution is computed by a simple driver or an expert driver:
can be computed by the following code fragment.
EPSMCH = SLAMCH( 'E' ) * Get infinity-norm of A ANORM = SLANGE( 'I', N, N, A, LDA, WORK ) * Solve system; The solution X overwrites B CALL SGESV( N, 1, A, LDA, IPIV, B, LDB, INFO ) IF( INFO.GT.0 ) THEN PRINT *,'Singular Matrix' ELSE IF (N .GT. 0) THEN * Get reciprocal condition number RCOND of A CALL SGECON( 'I', N, A, LDA, ANORM, RCOND, $ WORK, IWORK, INFO ) RCOND = MAX( RCOND, EPSMCH ) ERRBD = EPSMCH / RCOND END IF
,
Then (to 4 decimal places)
, , the true reciprocal condition number , , and the true error .
For example, the following code fragment solves Ax = b and computes an approximate error bound FERR:
CALL SGESVX( 'E', 'N', N, 1, A, LDA, AF, LDAF, IPIV, $ EQUED, R, C, B, LDB, X, LDX, RCOND, FERR, BERR, $ WORK, IWORK, INFO ) IF( INFO.GT.0 ) PRINT *,'(Nearly) Singular Matrix'
For the same A and b as above, , , and the actual error is .
This example illustrates that the expert driver provides an error bound with less programming effort than the simple driver, and also that it may produce a significantly more accurate answer.
Similar code fragments, with obvious adaptations, may be used with all the driver routines for linear equations listed in Table 2.2. For example, if a symmetric system is solved using the simple driver xSYSV, then xLANSY must be used to compute ANORM, and xSYCON must be used to compute RCOND.
LAPACK can solve systems of linear equations, linear least squares problems, eigenvalue problems and singular value problems. LAPACK can also handle many associated computations such as matrix factorizations or estimating condition numbers.
LAPACK contains driver routines for solving standard types of problems, computational routines to perform a distinct computational task, and auxiliary routines to perform a certain subtask or common low-level computation. Each driver routine typically calls a sequence of computational routines. Taken as a whole, the computational routines can perform a wider range of tasks than are covered by the driver routines. Many of the auxiliary routines may be of use to numerical analysts or software developers, so we have documented the Fortran source for these routines with the same level of detail used for the LAPACK routines and driver routines.
Dense and band matrices are provided for, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices. See Chapter 2 for a complete summary of the contents.
The conventional error analysis of linear equation solving goes as follows. Let Ax = b be the system to be solved. Let be the solution computed by LAPACK (or LINPACK) using any of their linear equation solvers. Let r be the residual . In the absence of rounding error r would be zero and would equal x; with rounding error one can only say the following:
The normwise backward error of the computed solution , with respect to the infinity norm, is the pair E,f which minimizes
subject to the constraint . The minimal value of is given by
One can show that the computed solution satisfies , where p(n) is a modestly growing function of n. The corresponding condition number is . The error is bounded by
In the first code fragment in the last section, , which is in the numerical example, is approximated by . Approximations of - or, strictly speaking, its reciprocal RCOND - are returned by computational routines xyyCON (subsection 2.3.1) or driver routines xyySVX (subsection 2.2.1). The code fragment makes sure RCOND is at least EPSMCH to avoid overflow in computing ERRBD. This limits ERRBD to a maximum of 1, which is no loss of generality since a relative error of 1 or more indicates the same thing: a complete loss of accuracy. Note that the value of RCOND returned by xyySVX may apply to a linear system obtained from Ax = b by equilibration, i.e. scaling the rows and columns of A in order to make the condition number smaller. This is the case in the second code fragment in the last section, where the program chose to scale the rows by the factors returned in and scale the columns by the factors returned in , resulting in .
As stated in section 4.3.2, this approach does not respect the presence of zero or tiny entries in A. In contrast, the LAPACK computational routines xyyRFS (subsection 2.3.1) or driver routines xyySVX (subsection 2.2.1) will (except in rare cases) compute a solution with the following properties:
The componentwise backward error of the computed solution is the pair E,f which minimizes
(where we interpret 0 / 0 as 0) subject to the constraint . The minimal value of is given by
One can show that for most problems the computed by xyySVX satisfies , where p(n) is a modestly growing function of n. In other words, is the exact solution of the perturbed problem where E and f are small relative perturbations in each entry of A and b, respectively. The corresponding condition number is . The error is bounded by
The routines xyyRFS and xyySVX return , which is called BERR (for Backward ERRor), and a bound on the the actual error , called FERR (for Forward ERRor), as in the second code fragment in the last section. FERR is actually calculated by the following formula, which can be smaller than the bound given above:
Here, is the computed value of the residual , and the norm in the numerator is estimated using the same estimation subroutine used for RCOND.
The value of BERR for the example in the last section is .
Even in the rare cases where xyyRFS fails to make BERR close to its minimum , the error bound FERR may remain small. See [6] for details.
The linear least squares problem is to find x that minimizes . We discuss error bounds for the most common case where A is m-by-n with m > n, and A has full rank ; this is called an overdetermined least squares problem (the following code fragments deal with m = n as well).
Let be the solution computed by one of the driver routines xGELS, xGELSX or xGELSS (see section 2.2.2). An approximate error bound
may be computed in one of the following ways, depending on which type of driver routine is used:
EPSMCH = SLAMCH( 'E' ) * Get the 2-norm of the right hand side B BNORM = SNRM2( M, B, 1 ) * Solve the least squares problem; the solution X * overwrites B CALL SGELS( 'N', M, N, 1, A, LDA, B, LDB, WORK, $ LWORK, INFO ) IF ( MIN(M,N) .GT. 0 ) THEN * Get the 2-norm of the residual A*X-B RNORM = SNRM2( M-N, B( N+1 ), 1 ) * Get the reciprocal condition number RCOND of A CALL STRCON('I', 'U', 'N', N, A, LDA, RCOND, $ WORK, IWORK, INFO) RCOND = MAX( RCOND, EPSMCH ) IF ( BNORM .GT. 0.0 ) THEN SINT = RNORM / BNORM ELSE SINT = 0.0 ENDIF COST = MAX( SQRT( (1.0E0 - SINT)*(1.0E0 + SINT) ), $ EPSMCH ) TANT = SINT / COST ERRBD = EPSMCH*( 2.0E0/(RCOND*COST) + $ TANT / RCOND**2 ) ENDIF
For example, if ,
then, to 4 decimal places,
, , , , and the true error is .
EPSMCH = SLAMCH( 'E' ) * Get the 2-norm of the right hand side B BNORM = SNRM2( M, B, 1 ) * Solve the least squares problem; the solution X * overwrites B RCND = 0 CALL SGELSX( M, N, 1, A, LDA, B, LDB, JPVT, RCND, $ RANK, WORK, INFO ) IF ( RANK.LT.N ) THEN PRINT *,'Matrix less than full rank' ELSE IF ( MIN( M,N ) .GT. 0 ) THEN * Get the 2-norm of the residual A*X-B RNORM = SNRM2( M-N, B( N+1 ), 1 ) * Get the reciprocal condition number RCOND of A CALL STRCON('I', 'U', 'N', N, A, LDA, RCOND, $ WORK, IWORK, INFO) RCOND = MAX( RCOND, EPSMCH ) IF ( BNORM .GT. 0.0 ) THEN SINT = RNORM / BNORM ELSE SINT = 0.0 ENDIF COST = MAX( SQRT( (1.0E0 - SINT)*(1.0E0 + SINT) ), $ EPSMCH ) TANT = SINT / COST ERRBD = EPSMCH*( 2.0E0/(RCOND*COST) + $ TANT / RCOND**2 ) END IFThe numerical results of this code fragment on the above A and b are the same as for the first code fragment.
CALL SGELSS( M, N, 1, A, LDA, B, LDB, S, RCND, RANK, $ WORK, LWORK, INFO )
and the call to STRCON must be replaced by:
RCOND = S( N ) / S( 1 )
Applied to the same A and b as above, the computed is nearly the same, , , and the true error is .
The conventional error analysis of linear least squares problems goes as follows . As above, let be the solution to minimizing computed by LAPACK using one of the least squares drivers xGELS, xGELSS or xGELSX (see subsection 2.2.2). We discuss the most common case, where A is overdetermined (i.e., has more rows than columns) and has full rank [45]:
The computed solution has a small normwise backward error. In other words minimizes , where E and f satisfy
and p(n) is a modestly growing function of n. We take p(n) = 1 in the code fragments above. Let (approximated by 1/RCOND in the above code fragments), (= RNORM above), and (SINT = RNORM / BNORM above). Here, is the acute angle between the vectors and . Then when is small, the error is bounded by
where = COST and = TANT in the code fragments above.
We avoid overflow by making sure RCOND and COST are both at least EPSMCH, and by handling the case of a zero B matrix separately (BNORM = 0).
may be computed directly from the singular values of A returned by xGELSS (as in the code fragment) or by xGESVD. It may also be approximated by using xTRCON following calls to xGELS or xGELSX. xTRCON estimates or instead of , but these can differ from by at most a factor of n.
If A is rank-deficient, xGELSS and xGELSX can be used to regularize the problem by treating all singular values less than a user-specified threshold ( ) as exactly zero. The number of singular values treated as nonzero is returned in RANK. See [45] for error bounds in this case, as well as [45] [19] for the underdetermined case.
The solution of the overdetermined, full-rank problem may also be characterized as the solution of the linear system of equations
By solving this linear system using xyyRFS or xyySVX (see section 4.4) componentwise error bounds can also be obtained [7].
There are two kinds of generalized least squares problems that are discussed in section 2.2.3: the linear equality-constrained least squares problem, and the general linear model problem. Error bounds for these problems will be included in a future version of this manual.
The eigendecomposition of an n-by-n real symmetric matrix is the factorization ( in the complex Hermitian case), where Z is orthogonal (unitary) and is real and diagonal, with . The are the eigenvalues of Aand the columns of Z are the eigenvectors . This is also often written . The eigendecomposition of a symmetric matrix is computed by the driver routines xSYEV, xSYEVX, xSYEVD, xSBEV, xSBEVX, xSBEVD, xSPEV, xSPEVX, xSPEVD, xSTEV, xSTEVX and xSTEVD. The complex counterparts of these routines, which compute the eigendecomposition of complex Hermitian matrices, are the driver routines xHEEV, xHEEVX, xHEEVD, xHBEV, xHBEVX, xHBEVD, xHPEV, xHPEVX, and xHPEVD (see subsection 2.2.4).
The approximate error bounds for the computed eigenvalues are
The approximate error bounds for the computed eigenvectors , which bound the acute angles between the computed eigenvectors and true eigenvectors , are:
These bounds can be computed by the following code fragment:
EPSMCH = SLAMCH( 'E' ) * Compute eigenvalues and eigenvectors of A * The eigenvalues are returned in W * The eigenvector matrix Z overwrites A CALL SSYEV( 'V', UPLO, N, A, LDA, W, WORK, LWORK, INFO ) IF( INFO.GT.0 ) THEN PRINT *,'SSYEV did not converge' ELSE IF ( N.GT.0 ) THEN * Compute the norm of A ANORM = MAX( ABS( W(1) ), ABS( W(N) ) ) EERRBD = EPSMCH * ANORM * Compute reciprocal condition numbers for eigenvectors CALL SDISNA( 'Eigenvectors', N, N, W, RCONDZ, INFO ) DO 10 I = 1, N ZERRBD( I ) = EPSMCH * ( ANORM / RCONDZ( I ) ) 10 CONTINUE ENDIF
For example, if and
then the eigenvalues, approximate error bounds, and true errors are
The usual error analysis of the symmetric eigenproblem (using any LAPACK routine in subsection 2.2.4 or any EISPACK routine) is as follows [64]:
The computed eigendecomposition is nearly the exact eigendecomposition of A + E, i.e., is a true eigendecomposition so that is orthogonal, where and . Here p(n) is a modestly growing function of n. We take p(n) = 1 in the above code fragment. Each computed eigenvalue differs from a true by at most
Thus large eigenvalues (those near ) are computed to high relative accuracy and small ones may not be.
The angular difference between the computed unit eigenvector and a true unit eigenvector satisfies the approximate bound
if is small enough. Here is the absolute gap between and the nearest other eigenvalue. Thus, if is close to other eigenvalues, its corresponding eigenvector may be inaccurate. The gaps may be easily computed from the array of computed eigenvalues using subroutine SDISNA . The gaps computed by SDISNA are ensured not to be so small as to cause overflow when used as divisors.
Let be the invariant subspace spanned by a collection of eigenvectors , where is a subset of the integers from 1 to n. Let S be the corresponding true subspace. Then
is the absolute gap between the eigenvalues in and the nearest other eigenvalue. Thus, a cluster of close eigenvalues which is far away from any other eigenvalue may have a well determined invariant subspace even if its individual eigenvectors are ill-conditioned .
In the special case of a real symmetric tridiagonal matrix T, the eigenvalues and eigenvectors can be computed much more accurately. xSYEV (and the other symmetric eigenproblem drivers) computes the eigenvalues and eigenvectors of a dense symmetric matrix by first reducing it to tridiagonal form T, and then finding the eigenvalues and eigenvectors of T. Reduction of a dense matrix to tridiagonal form T can introduce additional errors, so the following bounds for the tridiagonal case do not apply to the dense case.
The eigenvalues of T may be computed with small componentwise relative backward error ( ) by using subroutine xSTEBZ (subsection 2.3.4) or driver xSTEVX (subsection 2.2.4). If T is also positive definite, they may also be computed at least as accurately by xPTEQR (subsection 2.3.4). To compute error bounds for the computed eigenvalues we must make some assumptions about T. The bounds discussed here are from [13]. Suppose T is positive definite, and write T = DHD where and . Then the computed eigenvalues can differ from true eigenvalues by
where p(n) is a modestly growing function of n. Thus if is moderate, each eigenvalue will be computed to high relative accuracy, no matter how tiny it is. The eigenvectors computed by xPTEQR can differ from true eigenvectors by at most about
if is small enough, where is the relative gap between and the nearest other eigenvalue. Since the relative gap may be much larger than the absolute gap, this error bound may be much smaller than the previous one.
could be computed by applying xPTCON (subsection 2.3.1) to H. The relative gaps are easily computed from the array of computed eigenvalues.
Jacobi's method [69] [76] [24] is another algorithm for finding eigenvalues and eigenvectors of symmetric matrices. It is slower than the algorithms based on first tridiagonalizing the matrix, but is capable of computing more accurate answers in several important cases. Routines implementing Jacobi's method and corresponding error bounds will be available in a future LAPACK release.
The nonsymmetric eigenvalue problem is more complicated than the symmetric eigenvalue problem. In this subsection, we state the simplest bounds and leave the more complicated ones to subsequent subsections.
Let A be an n-by-n nonsymmetric matrix, with eigenvalues . Let be a right eigenvector corresponding to : . Let and be the corresponding computed eigenvalues and eigenvectors, computed by expert driver routine xGEEVX (see subsection 2.2.4).
The approximate error bounds for the computed eigenvalues are
The approximate error bounds for the computed eigenvectors , which bound the acute angles between the computed eigenvectors and true eigenvectors , are
These bounds can be computed by the following code fragment:
EPSMCH = SLAMCH( 'E' ) * Compute the eigenvalues and eigenvectors of A * WR contains the real parts of the eigenvalues * WI contains the real parts of the eigenvalues * VL contains the left eigenvectors * VR contains the right eigenvectors CALL SGEEVX( 'P', 'V', 'V', 'B', N, A, LDA, WR, WI, $ VL, LDVL, VR, LDVR, ILO, IHI, SCALE, ABNRM, $ RCONDE, RCONDV, WORK, LWORK, IWORK, INFO ) IF( INFO.GT.0 ) THEN PRINT *,'SGEEVX did not converge' ELSE IF ( N.GT.0 ) THEN DO 10 I = 1, N EERRBD(I) = EPSMCH*ABNRM/RCONDE(I) VERRBD(I) = EPSMCH*ABNRM/RCONDV(I) 10 CONTINUE ENDIF
For example, if and
then true eigenvalues, approximate eigenvalues, approximate error bounds, and true errors are
In this subsection, we will summarize all the available error bounds. Later subsections will provide further details. The reader may also refer to [11].
Bounds for individual eigenvalues and eigenvectors are provided by driver xGEEVX (subsection 2.2.4) or computational routine xTRSNA (subsection 2.3.5). Bounds for clusters of eigenvalues and their associated invariant subspace are provided by driver xGEESX (subsection 2.2.4) or computational routine xTRSEN (subsection 2.3.5).
We let be the i-th computed eigenvalue and an i-th true eigenvalue. Let be the corresponding computed right eigenvector, and a true right eigenvector (so ). If is a subset of the integers from 1 to n, we let denote the average of the selected eigenvalues: , and similarly for . We also let denote the subspace spanned by ; it is called a right invariant subspace because if v is any vector in then Av is also in . is the corresponding computed subspace.
The algorithms for the nonsymmetric eigenproblem are normwise backward stable: they compute the exact eigenvalues, eigenvectors and invariant subspaces of slightly perturbed matrices A + E, where . Some of the bounds are stated in terms of and others in terms of ; one may use to approximate either quantity. The code fragment in the previous subsection approximates by , where is returned by xGEEVX.
xGEEVX (or xTRSNA) returns two quantities for each , pair: and . xGEESX (or xTRSEN) returns two quantities for a selected subset of eigenvalues: and . (or ) is a reciprocal condition number for the computed eigenvalue (or ), and is referred to as RCONDE by xGEEVX (or xGEESX). (or ) is a reciprocal condition number for the right eigenvector (or ), and is referred to as RCONDV by xGEEVX (or xGEESX). The approximate error bounds for eigenvalues, averages of eigenvalues, eigenvectors, and invariant subspaces provided in Table 4.5 are true for sufficiently small ||E||, which is why they are called asymptotic.
Table 4.5: Asymptotic error bounds for the nonsymmetric eigenproblem
If the problem is ill-conditioned, the asymptotic bounds may only hold for extremely small ||E||. Therefore, in Table 4.6 we also provide global bounds which are guaranteed to hold for all .
Table 4.6: Global error bounds for the nonsymmetric eigenproblem
assuming
We also have the following bound, which is true for all E: all the lie in the union of n disks, where the i-th disk is centered at and has radius . If k of these disks overlap, so that any two points inside the k disks can be connected by a continuous curve lying entirely inside the k disks, and if no larger set of k + 1 disks has this property, then exactly k of the lie inside the union of these k disks. Figure 4.1 illustrates this for a 10-by-10 matrix, with 4 such overlapping unions of disks, two containing 1 eigenvalue each, one containing 2 eigenvalues, and one containing 6 eigenvalues.
Figure 4.1: Bounding eigenvalues inside overlapping disks
Finally, the quantities s and sep tell use how we can best (block) diagonalize a matrix A by a similarity, , where each diagonal block has a selected subset of the eigenvalues of A. Such a decomposition may be useful in computing functions of matrices, for example. The goal is to choose a V with a nearly minimum condition number which performs this decomposition, since this generally minimizes the error in the decomposition. This may be done as follows. Let be -by- . Then columns through of V span the invariant subspace of A corresponding to the eigenvalues of ; these columns should be chosen to be any orthonormal basis of this space (as computed by xGEESX, for example). Let be the value corresponding to the cluster of eigenvalues of , as computed by xGEESX or xTRSEN. Then , and no other choice of V can make its condition number smaller than [17]. Thus choosing orthonormal subblocks of V gets to within a factor b of its minimum value.
In the case of a real symmetric (or complex Hermitian) matrix, s = 1 and sep is the absolute gap, as defined in subsection 4.7. The bounds in Table 4.5 then reduce to the bounds in subsection 4.7.
There are two preprocessing steps one may perform on a matrix A in order to make its eigenproblem easier. The first is permutation, or reordering the rows and columns to make A more nearly upper triangular (closer to Schur form): , where P is a permutation matrix. If is permutable to upper triangular form (or close to it), then no floating-point operations (or very few) are needed to reduce it to Schur form. The second is scaling by a diagonal matrix D to make the rows and columns of more nearly equal in norm: . Scaling can make the matrix norm smaller with respect to the eigenvalues, and so possibly reduce the inaccuracy contributed by roundoff [][Chap. II/11]wilkinson3. We refer to these two operations as .
Balancing is performed by driver xGEEVX, which calls computational routine xGEBAL. The user may tell xGEEVX to optionally permute, scale, do both, or do neither; this is specified by input parameter BALANC. Permuting has no effect on the condition numbers or their interpretation as described in previous subsections. Scaling, however, does change their interpretation, as we now describe.
The output parameters of xGEEVX - SCALE (real array of length N), ILO (integer), IHI (integer) and ABNRM (real) - describe the result of balancing a matrix A into , where N is the dimension of A. The matrix is block upper triangular, with at most three blocks: from 1 to ILO - 1, from ILO to IHI, and from IHI + 1 to N. The first and last blocks are upper triangular, and so already in Schur form. These are not scaled; only the block from ILO to IHI is scaled. Details of the scaling and permutation are described in SCALE (see the specification of xGEEVX or xGEBAL for details) . The one-norm of is returned in ABNRM.
The condition numbers described in earlier subsections are computed for the balanced matrix , and so some interpretation is needed to apply them to the eigenvalues and eigenvectors of the original matrix A. To use the bounds for eigenvalues in Tables 4.5 and 4.6, we must replace and by . To use the bounds for eigenvectors, we also need to take into account that bounds on rotations of eigenvectors are for the eigenvectors of , which are related to the eigenvectors x of A by , or . One coarse but simple way to do this is as follows: let be the bound on rotations of from Table 4.5 or Table 4.6 and let be the desired bound on rotation of x. Let
be the condition number of D. Then
The numerical example in subsection 4.8 does no scaling, just permutation.
LAPACK is designed to give high efficiency on vector processors, high-performance ``super-scalar'' workstations, and shared memory multiprocessors. LAPACK in its present form is less likely to give good performance on other types of parallel architectures (for example, massively parallel SIMD machines, or distributed memory machines), but work has begun to try to adapt LAPACK to these new architectures. LAPACK can also be used satisfactorily on all types of scalar machines (PC's, workstations, mainframes). See Chapter 3 for some examples of the performance achieved by LAPACK routines.
To explain s and sep , we need to introduce the spectral projector P [56] [72], and the separation of two matrices A and B, sep(A , B) [75] [72].
We may assume the matrix A is in Schur form, because reducing it to this form does not change the values of s and sep. Consider a cluster of m > = 1 eigenvalues, counting multiplicities. Further assume the n-by-n matrix A is
where the eigenvalues of the m-by-n matrix are exactly those in which we are interested. In practice, if the eigenvalues on the diagonal of A are in the wrong order, routine xTREXC can be used to put the desired ones in the upper left corner as shown.
We define the spectral projector, or simply projector P belonging to the eigenvalues of as
where R satisfies the system of linear equations
Equation ( 4.3) is called a Sylvester equation . Given the Schur form ( 4.1), we solve equation ( 4.3) for R using the subroutine xTRSYL.
We can now define s for the eigenvalues of :
In practice we do not use this expression since is hard to compute. Instead we use the more easily computed underestimate
which can underestimate the true value of s by no more than a factor . This underestimation makes our error bounds more conservative. This approximation of s is called RCONDE in xGEEVX and xGEESX.
The separation of the matrices and is defined as the smallest singular value of the linear map in ( 4.3) which takes X to , i.e.,
This formulation lets us estimate using the condition estimator xLACON [52] [51] [48], which estimates the norm of a linear operator given the ability to compute T and quickly for arbitrary x. In our case, multiplying an arbitrary vector by T means solving the Sylvester equation ( 4.3) with an arbitrary right hand side using xTRSYL, and multiplying by means solving the same equation with replaced by and replaced by . Solving either equation costs at most operations, or as few as if m << n. Since the true value of sep is but we use , our estimate of sep may differ from the true value by as much as . This approximation to sep is called RCONDV by xGEEVX and xGEESX.
Another formulation which in principle permits an exact evaluation of is
where is the Kronecker product of X and Y. This method is generally impractical, however, because the matrix whose smallest singular value we need is m(n - m) dimensional, which can be as large as . Thus we would require as much as extra workspace and operations, much more than the estimation method of the last paragraph.
The expression measures the ``separation'' of the spectra of and in the following sense. It is zero if and only if and have a common eigenvalue, and small if there is a small perturbation of either one that makes them have a common eigenvalue. If and are both Hermitian matrices, then is just the gap, or minimum distance between an eigenvalue of and an eigenvalue of . On the other hand, if and are non-Hermitian, may be much smaller than this gap.
The singular value decomposition (SVD) of a real m-by-n matrix A is defined as follows. Let r = min(m , n). The the SVD of A is ( in the complex case), where U and V are orthogonal (unitary) matrices and is diagonal, with . The are the singular values of A and the leading r columns of U and of V the left and right singular vectors, respectively. The SVD of a general matrix is computed by xGESVD (see subsection 2.2.4).
The approximate error bounds for the computed singular values are
The approximate error bounds for the computed singular vectors and , which bound the acute angles between the computed singular vectors and true singular vectors and , are
These bounds can be computing by the following code fragment.
EPSMCH = SLAMCH( 'E' ) * Compute singular value decomposition of A * The singular values are returned in S * The left singular vectors are returned in U * The transposed right singular vectors are returned in VT CALL SGESVD( 'S', 'S', M, N, A, LDA, S, U, LDU, VT, LDVT, $ WORK, LWORK, INFO ) IF( INFO.GT.0 ) THEN PRINT *,'SGESVD did not converge' ELSE IF ( MIN(M,N) .GT. 0 ) THEN SERRBD = EPSMCH * S(1) * Compute reciprocal condition numbers for singular * vectors CALL SDISNA( 'Left', M, N, S, RCONDU, INFO ) CALL SDISNA( 'Right', M, N, S, RCONDV, INFO ) DO 10 I = 1, MIN(M,N) VERRBD( I ) = EPSMCH*( S(1)/RCONDV( I ) ) UERRBD( I ) = EPSMCH*( S(1)/RCONDU( I ) ) 10 CONTINUE END IF
For example, if and
then the singular values, approximate error bounds, and true errors are given below.
The usual error analysis of the SVD algorithm xGESVD in LAPACK (see subsection 2.2.4) or the routines in LINPACK and EISPACK is as follows [45]:
The SVD algorithm is backward stable. This means that the computed SVD, , is nearly the exact SVD of A + E where , and p(m , n) is a modestly growing function of m and n. This means is the true SVD, so that and are both orthogonal, where , and . Each computed singular value differs from true by at most
(we take p(m , n) = 1 in the code fragment). Thus large singular values (those near ) are computed to high relative accuracy and small ones may not be.
The angular difference between the computed left singular vector and a true satisfies the approximate bound
where is the absolute gap between and the nearest other singular value. We take p(m , n) = 1 in the code fragment. Thus, if is close to other singular values, its corresponding singular vector may be inaccurate. When n > m, then must be redefined as . The gaps may be easily computed from the array of computed singular values using function SDISNA. The gaps computed by SDISNA are ensured not to be so small as to cause overflow when used as divisors. The same bound applies to the computed right singular vector and a true vector .
Let be the space spanned by a collection of computed left singular vectors , where is a subset of the integers from 1 to n. Let S be the corresponding true space. Then
where
is the absolute gap between the singular values in and the nearest other singular value. Thus, a cluster of close singular values which is far away from any other singular value may have a well determined space even if its individual singular vectors are ill-conditioned. The same bound applies to a set of right singular vectors .
In the special case of bidiagonal matrices, the singular values and singular vectors may be computed much more accu