.ds TH "\h'-.50m'\v'-.40m'^\v'.40m'\h'.50m' .ds TD "\h'-.65m'\v'-.40m'~\v'.40m' .ds CT "\(mu\h'-.66m'\(ci .ds NI "\(mo\h'-.65m'/ .nr EP 1 .nr VS 14 .EQ gsize 11 delim @@ ... define | '^\(br^' ... define || '\(br ^ \(br ' define notin '\(mo back 30 / ' define times '\(mu' define ctimes '\(mu back 80 \(ci ^ ' define hhat ' up 5 hat down 5 ' define bl '\0' .EN .TL A Fully Parallel Algorithm for the Symmetric Eigenvalue Problem .AU .ps 11 .in 0 J.J. Dongarra and D. C. Sorensen .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' 9700 South Cass Avenue\h'.20i' Argonne, Illinois 60439\h'.20i' .nr PS 11 .ps 11 .nr VS 16 .vs 16 .nr PD 0.5v .FS Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy under Contracts W-31-109-Eng-38, DE-AC05-840R21400 and DE-FG02-85ER25001. Presented at the Second SIAM Conference on Vector and Parallel Processing in Scientific Computing, Norfolk, Virginia, November 20, 1985. .FE .ps 10 .AB .PP In this paper we present a parallel algorithm for the symmetric algebraic eigenvalue problem. The algorithm is based upon a divide and conquer scheme suggested by Cuppen for computing the eigensystem of a symmetric tridiagonal matrix. We extend this idea to obtain a parallel algorithm that retains a number of active parallel processes that is greater than or i equal to the initial number throughout the course of the computation. We give a new deflation technique which together with a robust root finding technique will assure computation of an eigensystem to full accuracy in the residuals and in the orthogonality of eigenvectors. A brief analysis of the numerical properties and sensitivity to round off error is presented to indicate where numerical difficulties may occur. The algorithm is able to exploit parallelism at all levels of the computation and is well suited to a variety of architectures. .PP Computational results are presented for several machines. These results are very encouraging with respect to both accuracy and speedup. A surprising result is that the parallel algorithm, even when run in serial mode, can be significantly faster than the previously best sequential algorithm on large problems, and is effective on moderate size problems when run in serial mode. .NH Introduction .PP The symmetric eigenvalue problem is one of the most fundamental problems of computational mathematics. It arises in many applications and therefore represents an important area for algorithmic research. The problem has received considerable attention in the literature and was probably the first algebraic eigenvalue problem for which reliable methods were obtained. It would be surprising therefore, if a new method were to be found that would offer a significant improvement in execution time over the fundamental algorithms available in standard software packages such as EISPACK [11]. However, it is reasonable to expect that eigenvalue calculations might be accelerated through the use of parallel algorithms for parallel computers that are becoming available. We shall present such an algorithm in this paper. The algorithm is able to exploit parallelism at all levels of the computation and is well suited to a variety of architectures. However, the surprising result is that the parallel algorithm, even when run in serial mode, is significantly faster than the previously best sequential algorithm on large problems, and is effective on moderate size (order @>=@ 30) problems when run in serial mode. .PP The problem we consider is the following: Given a real @n times n @ symmetric matrix @A@, find all of the eigenvalues and corresponding eigenvectors of @A@. It is well known [12] that under these assumptions .EQ (1.1) A ~=~ Q D Q sup T ~,~~~ roman { with } ~ Q sup T Q ~=~ I ~, .EN so that the columns of the matrix @Q@ are the orthonormal eigenvectors of @A@ and @D ~=~ diag( delta sub 1 ^,^ delta sub 2 ^,^...,^ delta sub n )@ is the diagonal matrix of eigenvalues. The standard algorithm for computing this decomposition is first to use a finite algorithm to reduce @A@ to tridiagonal form using a sequence of Householder transformations, and then to apply a version of the @QR@-algorithm to obtain all the eigenvalues and eigenvectors of the tridiagonal matrix[12]. The primary purpose of this paper is to describe a method for parallelizing the computation of the eigensystem of the tridiagonal matrix. However, the method we present is intended to be used in conjunction with the initial reduction to tridiagonal form to compute the complete eigensystem of the original matrix @A@. We, therefore, briefly touch upon the issues involved in parallelizing this initial reduction and suggest a way to combine the parallel initial reduction with the tridiagonal algorithm to obtain a fully parallel algorithm for the symmetric eigenvalue problem. .PP The method is based upon a divide and conquer algorithm suggested by Cuppen[3]. A fundamental tool used to implement this algorithm is a method that was developed by Bunch, Nielsen, and Sorensen[2] for updating the eigensystem of a symmetric matrix after modification by a rank one change. This rank-one updating method was inspired by some earlier work of Golub[4] on modified eigenvalue problems. The basic idea of the new method is to use rank-one modifications to tear out selected off-diagonal elements of the tridiagonal problem in order to introduce a number of independent subproblems of smaller size. The subproblems are solved at the lowest level using the subroutine TQL2 from EISPACK and then results of these problems are successively glued together using the rank-one modification routine SESUPD that we have developed based upon the ideas presented in [2]. .PP In the following discussion we describe the partitioning of the tridiagonal problem into smaller problems by rank-one tearing. Then we describe the numerical algorithm for gluing the results back together. The organization of the parallel algorithm is laid out, and finally some computational results are presented. Throughout this paper we adhere to the convention that capital Roman letters represent matrices, lower case Roman letters represent column vectors, and lower case Greek letters represent scalars. A superscript @T@ denotes transpose. All matrices and vectors are real, but the results are easily extended to matrices over the complex field. .NH Partitioning by Rank-One Tearing .PP The crux of the algorithm is to divide a given problem into two smaller subproblems. To do this, we consider the symmetric tridiagonal matrix .EQ (2.1) T mark ~=~ left ( ~ ~ matrix { ccol { T sub 1 above beta e sub 1 e sub k sup T } ccol { beta e sub k e sub 1 sup T above T sub 2 } } ~ ~ right ) .EN .EQ lineup ~=~ left ( ~ matrix { ccol { T hat sub 1 above 0 } ccol { 0 above T hat sub 2 }} ~ right ) ~+~ theta beta left ( ~ matrix { ccol { e sub k above theta sup -1 e sub 1} } ~ right ) ( e sub k sup T ~,~ theta sup -1 e sub 1 sup T ) .EN where @ 1 ~<=~ k ~<=~ n @ and @ e sub j @ represents the @j-th@ unit vector of appropriate dimension. The @k-th@ diagonal element of @ T sub 1@ has been modified to give @T hat sub 1 @ and the first diagonal element of @ T sub 2@ has been modified to give @ T hat sub 2 @. Potential numerical difficulties associated with cancellation may be avoided through the appropriate choice of @theta@. If the diagonal entries to be modified are of the same sign then @ theta ~= ~+-~ 1@ is chosen so that @- theta beta @ has this sign and cancellation is avoided. If the two diagonal entries are of opposite sign, then the sign of @theta@ is chosen so that @- theta beta@ has the same sign as one of the elements and the magnitude of @theta@ is chosen to avoid severe loss of significant digits when @theta sup -1 beta@ is subtracted from the other. This is perhaps a minor detail, but it does allow the partitioning to be selected solely on the basis of position and without regard to numerical considerations. .PP Now we have two smaller tridiagonal eigenvalue problems to solve. According to equation (1.1) we compute the two eigensystems .EQ T hat sub 1 ~=~ Q sub 1 D sub 1 Q sub 1 sup T ~,~~~~ T hat sub 2 ~=~ Q sub 2 D sub 2 Q sub 2 sup T ~. .EN This gives .EQ (2.2) T mark ~=~ left ( ~ matrix { ccol { Q sub 1 D sub 1 Q sub 1 sup T above 0 } ccol { 0 above Q sub 2 D sub 2 Q sub 2 sup T } } ~ right ) ~ +~ theta beta left ( ~ matrix { ccol { e sub k above theta sup -1 e sub 1} } ~ right ) ( e sub k sup T ~,~ theta sup -1 e sub 1 sup T ) .EN .EQ lineup ~=~ left ( ~ matrix { ccol { Q sub 1 above 0 } ccol { 0 above Q sub 2 } } ~ right ) left (~ left ( ~ matrix { ccol { D sub 1 above 0 } ccol { 0 above D sub 2 } } ~ right ) +~ theta beta left ( ~ matrix { ccol { q sub 1 above theta sup -1 q sub 2} } ~ right ) ( q sub 1 sup T ~,~ theta sup -1 q sub 2 sup T ) ~ ~ right ) left ( ~ matrix { ccol { Q sub 1 sup T above 0 } ccol { 0 above Q sub 2 sup T } } ~ right ) .EN where @ q sub 1 ~=~ Q sub 1 sup T e sub k@ and @ q sub 2 ~=~ Q sub 2 sup T e sub 1@. The problem at hand now is to compute the eigensystem of the interior matrix in equation (2.2). A numerical method for solving this problem has been provided in [2] and we shall discuss this method in the next section. .PP It should be fairly obvious how to proceed from here to exploit parallelism. One simply repeats the tearing on each of the two halves recursively until the original problem has been divided into the desired number of subproblems and then the rank one modification routine may be applied from bottom up to glue the results together again. .NH The Updating Problem .PP The general problem we are required to solve is that of computing the eigensystem of a matrix of the form .EQ (3.1) Q hat D hat Q hat sup T ~=~ D ~+~ rho zz sup T .EN where @ D @ is a real @n times n@ diagonal matrix, @rho@ is a scalar, and @z@ is a real vector of order @n@ . It is assumed without loss of generality that @z@ has Euclidean norm 1. .PP We seek a formula for an eigen-pair for the matrix on the right hand side of (3.1) . If @ q ^,^ lambda @ is such an eigen-pair then .EQ ( D ~+~ rho z z sup T ) q ~=~ lambda q .EN and a simple rearrangement of terms gives .EQ (D ~-~ lambda I )q ~=~ - rho ( z sup T q ) z ~. .EN Multiplying on the left by @ z sup T ( D ~-~ lambda I ) sup -1 @ gives .sp .EQ z sup T q ~=~ -^ rho (z sup T q ) z sup T (D ~-~ lambda I ) sup -1 z ~, .EN and hence .EQ 1 ~+~ rho z sup T (D ~-~ lambda I ) sup -1 z ~=~ 0 .EN must be satisfied by @lambda@. We have tacitly assumed @z sup T q ~!= ~ 0 @, does not imply @Dq ~=~ lambda q@ hence @lambda ~=~ delta sub i @ @q ~=~ e sub i @ for some @i@ . If we have @ 0 ~=~ zeta sub i ~==~ e sub i sup T z @ then @ q ~=~ e sub i @ is an eigenvector of @ D ~+~ rho z z sup T @. We have just shown that if @D ~=~ diag( delta sub 1 ^,^ delta sub 2 ^,^ ... , delta sub n ^ ) @ with @ delta sub 1 ^<^ delta sub 2 ^<^ ... <^ delta sub n @ and no component @zeta sub i@ of the vector @z@ is zero, then the updated eigenvalues @ delta \*(TH sub i @ are roots of the equation .EQ (3.2) f ( lambda ) ~==~ 1 ~+~ rho sum from {j = 1 } to n { zeta sub j sup 2 } over { delta sub j ~-~ lambda } ~=~ 0 . .EN Golub[4] refers to this as the secular equation and the behavior of its roots is pictorially described by the following graph: .KS .sp 20 .ce @Figure@ 1. The Secular Equation .KE .sp .PP Moreover, as shown in [2] the eigenvectors (i.e. the columns of @Q hat@ in (3.1)) are given by the formula .EQ (3.3) q hat sub i ~=~ gamma sub i DELTA sub i sup -1 z .EN with @gamma sub i @ chosen to make @ || ^ q hat sub i ^ || ~=~ 1 @ , and with @ DELTA sub i ~=~ diag ( delta sub 1 - delta \*(TH sub i ^,^ delta sub 2 - delta \*(TH sub i ^,^...^,^ delta sub n - delta \*(TH sub i ~)@. Due to this structure, an excellent numerical method may be devised to find the roots of the secular equation and as a by-product to compute the eigenvectors to full accuracy. It must be stressed, however, that great care must be exercised in the numerical method used to solve the secular equation and to construct the eigenvectors from formula (3.3). .PP In the following discussion we assume that @rho ~>~ 0@ in (3.2). A simple change of variables may always be used to achieve this, so there is no loss of generality. The method we shall describe was inspired by the work of More@prime@ [8] and Reinsch[9,10], and relies on the use of simple rational approximations to construct an iterative method for the solution of equation (3.2). Given that we wish to find the @i-th@ root @delta \*(TH sub i@ of the function @f@ in (3.2) we may write this function as .EQ f( lambda ) ~=~ 1 ~+~ phi ^ ( lambda ) ~+~ psi ^ ( lambda ) .EN where .EQ psi ^ ( lambda ) ~=~ rho sum from {j ^=^ 1 } to i { zeta sub j sup 2 } over { delta sub j ~-~ lambda } .EN and .EQ phi ^ ( lambda ) ~==~ rho sum from {j ^=^ i+1 } to n { zeta sub j sup 2 } over { delta sub j ~-~ lambda }. .EN .sp From the graph in Figure 1 it is seen that the root @ delta \*(TH sub i @ lies in the open interval @ ( delta sub i ^,^ delta sub i+1 )@ and for @lambda@ in this interval all of the terms of @psi@ are negative and all of the terms of @phi@ are positive. We may derive an iterative method for solving the equation .EQ - psi ^ ( lambda ) ~=~ 1 ~+~ phi ^ ( lambda ) .EN by starting with an initial guess @lambda sub 0@ in the appropriate interval and then constructing simple rational interpolants of the form .EQ p over {q ~-~ lambda } ~,~ ~ r ~+~ s over { delta ~-~ lambda } .EN where @delta@ is fixed at the current value of @ delta \*(TH sub i+1 @ , and the parameters @p,~q,~r,~s@ are defined by the interpolation conditions .EQ (3.4) p over {q ~-~ lambda sub 0 } ~=~ psi ^ ( lambda sub 0 ) ~,~ ~ r ~+~ s over { delta ~-~ lambda sub 0 } ~=~ phi ^ ( lambda sub 0 ) .EN .EQ p over { ( q ~-~ lambda sub 0 ) sup 2 } ~=~ psi ^ prime ( lambda sub 0 ) ~,~ ~ s over { ( delta ~-~ lambda sub 0 ) sup 2 } ~=~ phi ^ prime ( lambda sub 0 )~. .EN The new approximate @ lambda sub 1 @ to the root @ delta \*(TH sub i @ is then found by solving .EQ (3.5) {-^ p} over {q ~-~ lambda } ~=~ 1 ~+~ r ~+~ s over { delta ~-~ lambda }~. .EN It is possible to construct an initial guess which lies in the open interval (@ delta sub i ^,^ delta \*(TH sub i @). A sequence of iterates {@ lambda sub k @} may then be constructed as we have just described with @lambda sub k+1@ being derived from @lambda sub k @ as @lambda sub 1@ was derived from @ lambda sub 0@ above. The following theorem, proved in [2] then shows that this iteration converges quadratically from one side of the root and does not need any safeguarding. .sp2 @THEOREM@ (3.6) .I Let @rho ~>~ 0 @ in @(3.2)@. If the initial iterate @ lambda sub 0 @ lies in the open interval @(^ delta sub i ^,^ delta \*(TH sub i ^)@ then the sequence of iterates { @ lambda sub k @ } as constructed in equations @(3.4)-(3.5)@ are well defined and satisfy @ lambda sub k ~<~ lambda sub k+1 ~<~ delta \*(TH sub i @ for all @k ~>=~ 0 @. Moreover, the sequence converges quadratically to the root @ delta \*(TH sub i @. .R .PP In our implementation of this scheme equation (3.5) is cast in such a way that we solve for the iterative correction @ tau ~=~ lambda sub 1 - lambda sub 0 @ . The quantities @ delta sub j ~-~ lambda sub k@ which are used later in the eigenvector calculations are maintained and these iterative corrections are applied directly to them as well as to the eigenvalue approximation. Cancellation in the computation of these differences is thus avoided because the corrections become smaller and smaller and are eventually applied to the lowest order bits. These values are then used directly in the calculation of the updated eigenvectors to obtain the highest possible accuracy. The rapid convergence of the iterative method allows the specification of very stringent convergence criteria that will ensure a relative residual and orthogonality of eigenvectors to full machine accuracy. The stopping criterion used is .EQ (i)~ | f( lambda ) | ~~<=~~ eta ^ max( | delta sub 1 | ^,^ | delta sub 2 | ) ~~ and ~~ (ii) | tau | ~~<=~~ eta ^ min ( | delta sub i ~-~ lambda | ^,^ | delta sub i ~-~ lambda | ) , .EN where @ lambda @ is the current iterate and @ tau @ is the last iterative correction that was computed. The condition on @tau@ is very stringent. The purpose of such a stringent stopping criteria will be clarified in the next section when we discuss orthogonality of computed eigenvectors. Let it suffice at this point to say that condition @(i)@ assures a small residual and that @(ii)@ assures orthogonal eigenvectors. .PP In most problems these criteria are easily satisfied. However, there are pathological situations involving nearly equal eigenvalues that make @(ii)@ very difficult to satisfy. It is here that the basic root finding method described above must be modified. The problem stems from the fact that the iterative corrections cease to modify the value of the value of @f@ due to round off error. When working in single precision, this situation can be rectified through the use of extended precision accumulation of inner products in the evaluation of @f@ and its derivative. However, this becomes less attractive when working in 64-bit arithmetic as is done in most scientific calculations. Whether or not we shall be able to provide a root finder that will satisfy these stringent requirements in 64-bit arithmetic for very pathological cases remains to be seen. .PP This has been a brief description of the rank-one updating scheme. Full theoretical details are available in [2]. This calculation represents the workhorse of the parallel algorithm. The method seems to be very robust in practice and exhibits high accuracy. It does not seem to suffer the effects of nearly equal roots as Cuppen suggests [3] but instead was able to solve such ill conditioned problems as the Wilkinson matrices @W sub 2k+1 sup + @ [13 p.308] to full machine precision and with slightly better residual and orthogonality properties than the standard algorithm TQL2 from EISPACK. In the next section we offer some reasons for this behavior. .NH Deflation and Orthogonality of Eigenvectors .PP At the outset of this discussion we made the assumption that the diagonal elements of @D@ were distinct and that no component of the vector @z@ was zero. These conditions are not satisfied in general, so deflation techniques must be employed to ensure their satisfaction. A deflation technique was suggested in [2] to provide distinct eigenvalues which amounts to rotating the basis for the eigenspace corresponding to a multiple eigenvalue so that only one component of the vector @ z @ corresponding to this space is nonzero in the new basis. Those terms in (3.2) corresponding to zero components of @z@ may simply be dropped. The eigenvalues and eigenvectors corresponding to these zero components remain static. In finite precision arithmetic the situation becomes more interesting. Terms corresponding to small components of @z@ may be dropped. This can have a very dramatic effect upon the amount of work required in our parallel method. As first observed by Cuppen[3], there can be significant deflation in the updating process as the original matrix is rebuilt from the subproblems. .PP Deflation can occur in two ways, either through small components of @z@ or through nearly equal eigenvalues. We shall refine these notions in this section to include "nearly zero" components of @z@ and "nearly equal" eigenvalues. Analysis of the first situation is straightforward. The second situation can be quite delicate in certain pathological cases, however, so we shall discuss it here in some detail. It turns out that the case of nearly equal eigenvalues may be reduced to the case of small components of @z@. .PP It is straightforward to see that when @z sup T e sub i ~=~ 0@ the @i-th @ eigenvalue and corresponding eigenvector of @D@ will be an eigen-pair for @ D ~+~ rho z z sup T @. Let us ask the question "when is an eigen-pair of @D@ a good approximation to an eigen-pair for the modified matrix?" This question is easily answered. Recall that @ || ^ z ^ || ~=~ 1 @, so .EQ || ^ ( D ~+~ rho z z sup T ) e sub i ~-~ delta sub i e sub i ^ || ~=~ | rho zeta sub i |^ || ^ z ^ || ^ ~=~ | rho zeta sub i | . .EN Thus we may accept this eigen-pair as an eigen-pair for the modified matrix whenever .EQ | rho zeta sub i | ~<=~ tol .EN where @tol ~>~ 0 @ is our error tolerance. Typically @ tol ~=~ macheps ^ ( ^ eta ^ || A || ^ ) ^ @ , where @macheps@ is machine precision and @eta@ is a constant of order unity. However, at each stage of the updating process we may use .EQ tol ~=~ macheps ^ eta ^ ( ^ max ( | delta sub 1 | ^,^ | delta sub n | ) ~+~ | rho | ^) .EN since at every stage this will represent a bound on the spectral radius of a principal submatrix of @A@. Let us now suppose there are two eigenvalues of @D@ separated by @epsilon@ so that @epsilon ~=~ delta sub i+1 ~-~ delta sub i @. Consider the 2 @times@ 2 submatrix .EQ left (^ matrix { ccol { delta sub i above bl} ccol {bl above delta sub i+1 } } ^ right ) ~ ~+~ rho left ( ~ matrix { ccol { zeta sub i above zeta sub i+1} } ~ right ) ( zeta sub i ~,~ zeta sub i+1 ) .EN of @D ~+~ rho z z sup T @ and let us construct a Givens transformation to introduce a zero in one of the two corresponding components in @z@. Then .EQ left (^ matrix { ccol { gamma above - sigma} ccol {sigma above gamma } } ^ right ) ~ left ( ^ left (^ matrix { ccol { delta sub i above bl} ccol {bl above delta sub i+1 } } ^ right ) ~ ~+~ rho left ( ~ matrix { ccol { zeta sub i above zeta sub i+1} } ~ right ) ( zeta sub i ~,~ zeta sub i+1 ) ^ right ) ~ left (^ matrix { ccol { gamma above sigma} ccol {- sigma above gamma } } ^ right ) ~ .EN .EQ =~ left (^ matrix { ccol { delta \*(TH sub i above bl} ccol {bl above delta \*(TH sub i+1 } } ^ right ) ~ ~+~ rho left ( ~ matrix { ccol { tau above 0} } ~ right ) ( tau ~,~ 0 ) ~+~ epsilon ^ sigma ^ gamma ^ left (^ matrix { ccol { 0 above 1} ccol { 1 above 0 } } ^ right ) ~ .EN where @gamma sup 2 ~+~ sigma sup 2 ~=~ 1@ , @delta \*(TH sub i ~=~ delta sub i gamma sup 2 ~+~ delta sub i+1 sigma sup 2 ~@ @delta \*(TH sub i+1 ~=~ delta sub i sigma sup 2 ~+~ delta sub i+1 gamma sup 2 ~@ and @ tau sup 2 ~=~ zeta sub i sup 2 + zeta sub i+1 sup 2 @. Now, if we put @G sub i @ equal to the @n times n@ Givens transformation constructed from the identity by replacing the appropriate diagonal block with the @2 times 2@ rotation just constructed, we have .sp .EQ (4.1) G sub i (^ D ~+~ rho z z sup T ^ ) G sub i sup T ~=~ D hat ~+~ rho z hat { z hat } sup T ~+~~ E sub i .EN .sp with @e sub i sup T z hat ~=~ tau @ and @e sub i+1 sup T z hat ~=~ 0 @, @ e sub i sup T D hat e sub i ~=~ delta @ , @ e sub i+1 sup T D hat e sub i+1 ~=~ delta @ , and .EQ || ^ E sub i ^ || ~=~ | epsilon ^ sigma ^ gamma | ~~. .EN If we choose, instead, to zero out the @i-th@ component of @z@ then we simply apply @G sub i @ on the right and its transpose on the left in equation(4.1) to obtain the desired similar result. The only exception to the previous result is that the sign of the matrix @E sub i @ is reversed. Of course this deflation is only done when .EQ | epsilon ^ gamma ^ sigma | ~<=~ tol .EN is satisfied. .PP The result of applying all of the deflations is to replace the updating problem (3.1) with one of smaller size. When appropriate, this is accomplished by applying similarity transformations consisting of several Givens transformations. If @G@ represents the product of these transformations the result is .EQ G(~D ~+~ rho z z sup T ~ ) G sup T ~=~ left ( ~ matrix { ccol {{ ~D sub 1 ~-~ rho z sub 1 z sub 1 sup T } above 0} ccol { 0 above D sub 2}} ~right ) ~~+~~ E ~, .EN where .EQ || ~ E ~ || ~<=~ tol ^ eta sub 2 .EN with @ eta sub 2 @ of order unity . The cumulative effect of such errors is additive and thus the final computed eigensystem @ Q hat D hat Q hat sup T @ which satisfies .EQ || ~ A ~ - ~ Q hat D hat Q hat sup T || ~<=~ eta sub 3 tol .EN where @ eta sub 3 @ is again of order 1 in magnitude. The reduction in size of @ D sub 1 ~-~ rho z sub 1 z sub 1 sup T @ over the original rank 1 modification can be spectacular in certain cases. The effects of such deflation can be dramatic, for the amount of computation required to perform the updating is greatly reduced. .PP Let us now consider the possible limitations on orthogonality of eigenvectors due to nearly equal roots. Our first result is a perturbation lemma that will indicate the inherent difficulty associated with nearly equal roots. .sp2 @LEMMA (4.2)@ .I Let .EQ (4.3) q sub lambda sup T ~==~ ( ^ { zeta sub 1 } over { delta sub 1 - lambda } ^,^ { zeta sub 2 } over { delta sub 2 - lambda } ^,^...^ { zeta sub n } over { delta sub n - lambda } ^ ) left [ { rho } over { f prime ( lambda ) } right ] sup half , .EN where f is defined by formula (3.2). Then for any @ lambda , mu ~ notin ~ "{" delta sub i ^:^ i ~=~ 1,...,n "}" @ .EQ (4.4) | q sub lambda sup T q sub mu | ~=~ 1 over { | lambda ~-~ mu | } {{ | ^ f ( lambda ) ~-~ f ( mu ) ^ | } over { left [ f prime ( lambda ) f prime ( mu ) right ] sup half }} . .EN .sp @PROOF@: .R Note that .EQ (4.5) q sub lambda sup T q sub mu ~=~ left ( sum from {j = 1 } to n { zeta sub j sup 2 } over { (^ delta sub j ~-~ lambda ^) (^ delta sub j ~-~ mu ^) } right ) { | ^ rho ^ | } over { left [ f prime ( lambda ) f prime ( mu ) right ] } sup half ~. .EN But, .EQ { lambda ~-~ mu } over { (^ delta sub j ~-~ lambda ^) (^ delta sub j ~-~ mu ^) } ~=~ 1 over { (^ delta sub j ~-~ lambda ^)} ~-~ 1 over { (^ delta sub j ~-~ mu ^) } . .EN Thus .EQ ( lambda ~-~ mu ) rho sum from {j = 1 } to n { zeta sub j sup 2 } over { (^ delta sub j ~-~ lambda ^) (^ delta sub j ~-~ mu ^) } ~=~ rho sum from {j = 1 } to n { zeta sub j sup 2 } over { delta sub j ~-~ lambda } ~-~ rho sum from {j = 1 } to n { zeta sub j sup 2 } over { delta sub j ~-~ mu } ~=~ f( lambda ) ~-~ f ( mu ) .EN and the result follows . @\(sq@ .PP Note that in (4.3) @ q sub lambda @ is always a vector of unit length, and the set of @n@ vectors selected by setting @lambda@ equal to the roots of the secular equation is the set of eigenvectors for @ D ~+~ rho z z sup T @. Moreover, equation (4.4) shows that those eigenvectors are mutually orthogonal whenever @ lambda @ and @ mu @ are set to distinct roots of @f@. Finally, the term @ | lambda ~-~ mu | @ appearing in the denominator of (4.4) sends up a warning that it may be difficult to attain orthogonal eigenvectors when the roots @lambda @ and @ mu @ are close. We wish to examine this situation now. We will show that, as a result of the deflation process, the @ "{" delta sub i "}" @ are sufficiently separated, and the weights @ zeta sub i @ are uniformly large enough that the roots of @f@ are bounded away from each other. This statement is made explicit in the following .sp2 @LEMMA@ (4.6) .I Let @lambda@ be the root of @f@ in the @i - th @ sub interval (@ delta sub i@,@ delta sub i+1 @ ). If the deflation test is satisfied then either .EQ (i) | ^ delta sub i+1 ~-~ lambda ^ | ~>=~ half | delta sub i+1 ~-~ delta sub i | ~and~ | ^ delta sub i ~-~ lambda ^ | ~>=~ tol sup 2 over { | rho ( ^ delta sub i+1 - delta sub i ^ ) | ~+~ 2 rho sup 2 } ~, .EN or .EQ (ii) | ^ delta sub i ~-~ lambda ^ | ~>=~ half | delta sub i+1 ~-~ delta sub i | ~and~ | ^ delta sub i+1 ~-~ lambda ^ | ~>=~ { tol sup 2 } over { 2 rho sup 2 } ( delta sub i+1 ~-~ delta sub i ) .EN .sp .R @PROOF@ : In case (i) the fact that @f( lambda ) ~=~ 0 @ provides .EQ rho sum from {j = 1 } to i { zeta sub j sup 2 } over { lambda ~-~ delta sub j } ~=~ 1 ~+~ rho sum from {j = i+1 } to n { zeta sub j sup 2 } over { delta sub j ~-~ lambda } . .EN From this equation we find that .EQ | rho | { zeta sub i sup 2 } over { lambda ~-~ delta sub i } ~<=~ 1 + { | rho | } over { delta sub i+1 ~-~ lambda } ~<=~ 1 + { 2 | rho | } over { delta sub i+1 ~-~ delta sub i } , .EN and it follows readily that .EQ | delta sub i+1 ~-~ delta sub i | left ( { | rho | zeta sub i sup 2 } over { | delta sub i+1 ~-~ delta sub i | ~+~ 2 | rho | } right ) ~<=~ | delta sub i ~-~ lambda | .EN The deflation rules assure us that @| rho | zeta sub i sup 2 ~>=~ tol sup 2 / | rho | @ and the result follows. Case (ii) is similar. @\(sq@ .PP The form of the result given in (4.6) will have importance in the following lemma which gives better insight to the quality of numerical orthogonality attainable with this scheme. The bounds obtained are certainly less than one would hope for. Moreover, it is unfortunate that only the magnitude of the weights enter into the estimate. This obviously does not take into account the deflation due to nearly equal eigenvalues. Unfortunately, we have not been able to improve these estimates by other means and are left with this crude bound. .sp2 @LEMMA@ (4.7) .I Suppose that @ lambda \*(TH @ and @ mu hat @ are numerical approximations to exact roots @ lambda @ and @ mu @ of @f@. Assume that these roots are distinct and let the relative errors for the quantities @ delta sub i ~-~ lambda @ and @ delta sub i ~-~ mu @ be denoted by @ theta sub i @ and @ eta sub i @ respectively. That is, the computed quantities .EQ (4.8) delta sub i ~-~ lambda \*(TH ~=~ ( delta sub i ~-~ lambda )( 1 ~+~ theta sub i ) ~ ~and ~ ~ delta sub i ~-~ mu hat ~=~ ( delta sub i ~-~ mu ) ( 1 ~+~ eta sub i ), .EN for @ i ~=~ 1,2,...,n@ . Let @ q sub lambda hat @ and @ q sub mu hat @ be defined according to formula (4.3) using the computed quantities given in (4.8). If @ | theta sub i | , | eta sub i | ~<=~ epsilon ~ << ~ 1 @, then .EQ | q sub lambda hat sup T q sub mu hat | ~=~ | q sub lambda sup T E q sub mu | ~<=~ epsilon ^ ( 2 ^+^ epsilon ) left ( { 1 ^+^ epsilon } over { 1 ^-^ epsilon } right ) sup 2 ~, .EN with @E@ a diagonal matrix whose @i-th@ diagonal element is .EQ E sub ii ~=~ { theta sub i ~+~ eta sub i ~+~ theta sub i eta sub i } over { ( ^ 1 ~+~ theta sub i ^) ( ^ 1 ~+~ eta sub i ^) } left [ { f prime ( lambda ) f prime ( mu ) } over { f prime ( lambda \*(TH ) f prime ( mu hat ) } right ] sup half . .EN .sp .R @PROOF@ : From formula (4.3) we have .EQ -^ q sub lambda hat sup T q sub mu hat ~ =~ -^ left ( ^ sum from {j = 1 } to n { zeta sub j sup 2 } over { ( delta sub j ~-~ lambda ) ( delta sub j ~-~ mu ) ( 1 ~+~ theta sub j ) ( 1 ~+~ eta sub j ) } ^ right ) { | ^ rho ^ | } over { left [ f prime ( lambda \*(TH ) f prime ( mu hat ) right ] sup half } .EN .EQ ~=~ left ( ^ sum from {j = 1 } to n { zeta sub j sup 2 } over { ( delta sub j ~-~ lambda ) ( delta sub j ~-~ mu ) } ~-~ sum from {j = 1 } to n { zeta sub j sup 2 } over { ( delta sub j ~-~ lambda ) ( delta sub j ~-~ mu ) ( 1 ~+~ theta sub j ) ( 1 ~+~ eta sub j ) } ^ right ) { | ^ rho ^ | } over { left [ f prime ( lambda \*(TH ) f prime ( mu hat ) right ] sup half } .EN due to the orthogonality of the exact vectors. Thus, .EQ | q sub lambda hat sup T q sub mu hat | ~ =~ | sum from {j = 1 } to n left ( ^ { zeta sub j sup 2 } over { ( delta sub j ~-~ lambda ) ( delta sub j ~-~ mu ) } ^ right ) left ( ^ 1 ~-~ 1 over { ( ^ 1 ~+~ theta sub i ^) ( ^ 1 ~+~ eta sub j ^) } ^ right ) rho over { left [ f prime ( lambda \*(TH ) f prime ( mu hat ) right ] sup half } | . .EN Since the quantity .EQ { f prime ( lambda ) } over { f prime ( lambda \*(TH ) } ~=~ { sum from {j = 1 } to n { zeta sub j sup 2 } over { ( delta sub j ~-~ lambda ) sup 2 } } over { sum from {j = 1 } to n { zeta sub j sup 2 } over { ( delta sub j ~-~ lambda ) sup 2 ( 1 ~+~ theta sub j ) sup 2 } } ~<= ~ ( 1 ~+~ epsilon ) sup 2 .EN and since .EQ max E sub ii ~<=~ { epsilon ~+~ epsilon ~+~ epsilon sup 2 } over { (1 ~-~ epsilon ) ( 1 ~-~ epsilon ) } .EN the result follows. @\(sq@ .PP This shows that orthogonality can be assured whenever it is possible to provide small relative errors when computing the differences @ delta sub i ~-~ lambda @. Since, as we mentioned in Section 3, these quantities are updated with the iterative corrections to @ lambda @ and since deflation has guaranteed that the quantities in Lemma (4.6) are of bounded away from zero, it will be possible in theory to provide small relative errors. Obviously, the analysis given here, simple as it may be, could form the core of a rigorous error analysis of the method. As yet we lack a root finder that could be proven to satisfy numerically the relative error requirements. The one described above will do so in finite precision but may require extended precision accumulation of inner products in pathological cases. However, in practice we have not experienced difficulty with the exception of contrived examples. Cuppen [3] has shown that the computed eigenvectors may be safely reorthogonalized when needed. We hope to avoid this alternative though. .NH Reduction to Tridiagonal Form and Related Issues .PP This algorithm is designed to work in conjunction with Householders reduction of a symmetric matrix to tridiagonal form . In this standard technique a sequence of @n-1@ Householder transformations is applied to form .EQ A sub k+1 ~=~ (I~-~ alpha sub k w sub k w sub k sup T ) A sub k (I~-~ alpha sub k w sub k w sub k sup T )~, .EN with .EQ A sub k ~=~ left ( matrix { ccol { T sub k above 0 } ccol { 0 above A hat sub k } } right ) ~, .EN where @ T sub k @ is a tridiagonal matrix of order @k-1@. See [11] for details. We note here that we can form @(I~-~ alpha ww sup T )A(I~-~ alpha w w sup T ) @ using the following algorithm. .KS .TS center; l. \f1Algorithm 5.1 1. @v ~=~ Aw@ 2. @y sup T ~=~ v sup T ~-~ alpha ( w sup T v ) w sup T@ 3. Replace @A @ by @ A ~-~ alpha vw sup T ~-~ alpha w y sup T@ .TE .KE Steps 1 and 3 may all be parallelized because @A@ may be partitioned into blocks of columns @A ~=~ ( A sub 1 ^,^ A sub 2 ^,^ ... ^,^ A sub j )@ and each of the contributions corresponding to these columns may be carried out independently in steps 1 and 3. For example in step 3, .EQ A sub i ~<-~ A sub i ~-~ alpha v w sub i sup T ~-~ alpha w y sub i sup T .EN where the vectors @w@ and @y@ have been partitioned in a corresponding manner. In step 2 the partial results will have to be stored in temporary locations until all are completed and then they may be added together to obtain the final result. Note also that when the calculations are arranged this way advantage may be taken of vector operations when they are available. .PP Algorithm 5.1 has some disadvantages when incorporated into the reduction of a symmetric matrix to tridiagonal form. At the k-th stage of the reduction we have .EQ left ( ~ ~ matrix { ccol { T sup (k) above beta e sub 1 e sub k sup T } ccol { beta e sub k e sub 1 sup T above A sup (k) } } ~ ~ right ) .EN The reduction is advanced one step through the application of Algorithm 5.1 to the submatrix @ A sup (k) @. First, a fork-join synchronization construct is imposed since the matrix-vector product requires the entire matrix @ A sup (k) @ to be in place before this product can be completed, so that no portion of Step 2 may begin until Step 1 is finished. Second, due to symmetry, only the lower triangle of @ A sup (k) @ need be computed and this implies that vector lengths shorten during the computation. .PP When used in conjunction with the rank one tearing scheme, these drawbacks may be overcome. Suppose the final result of this decomposition is @T@ and that this matrix would be partitioned into @ (~ T sub 1 ^,^ T sub 2 ^,...,^ T sub m ~) @ by the rank one tearing if it were known. It is not necessary to wait until the entire reduction is completed, for @T sup (k) @ represents a leading principal submatrix of @T@. Thus, as soon as the first sub-tridiagonal matrix @ T sub 1 @ is exposed the process of computing its eigen system may be initiated. Similarly, as soon as @ T sub 2 @ is exposed its eigensystem may be computed. Then a rank one update may occur, and so on. In this way a number of independent processes may be spawned early on and the number of such processes ready to execute will remain above a reasonable height throughout the course of the computation. An efficient implementation of this scheme is difficult and will be the subject of a subsequent paper. Issues such as proper level of partitioning will have added importance due to the desire to have parallel processes set in motion as soon as possible. .PP When we do not wish to find eigenvectors there is no reason to store the product @Q@ of these Householder transformations. Nor is it necessary to accumulate the product of the successive eigenvector transformations resulting from the updating problem. That is, we do not need to overwrite @Q@ with .EQ Q ~<-~ Q left ( ~ matrix { ccol { Q sub 1 above 0 } ccol { 0 above Q sub 2 } } ~ right ) Q hat .EN where @Q sub 1 @, @Q sub 2 @ and @Q hat @ are the matrices appearing in equations (2.2) and (3.1) above. Instead, we may simply discard @Q@. Then the vector @q sub 1 @ may be formed as @T hat sub 1 @ is transformed to @D sub 1@ in (2.2) by accumulating the products of the transformations constructed in TQL2 that make up @Q sub 1@ against the vector @ e sub k @ . If there is more than one division then @Q sub 1 @ will have been calculated with the updating scheme. In this case we do not calculate all of @Q sub 1@ but instead use the component-wise formula for the eigenvectors to pick off the appropriate components needed to form @q sub 1 @. .PP This algorithm can be generalized to handle band matrices as well. Instead of performing a rank-one tearing to split the matrix into two independent subproblems, we proceed by making @(m+1)m/2@ rank-one changes designed to split the matrix, (@m@ is the half bandwidth, @m ~=~ 1@ for tridiagonal matrices). .KS .TS center; c c c c c c c c c c c c c c c c c c c c c c c c c c c c|c c c c c c c c c c c|c c c c c c c c c c c|c c c c c c c c c c c|c c c c c c c c c c c|c c c c c c c c c c c c c c c c. . . . x x x x x x x x x x @ i sup th@ row x x x x x _ _ _ _ @ i+1 sup th@ row x x x x x x x x x x x x x x x . . . .TE .KE The three rank-one changes for this band matrix, (@m~=~2@), involve the following elements: .TS center; l. @a sub i-1,i-1 @, @ a sub i-1,i+1 @, @a sub i+1,i-1 @, @ a sub i+1,i+1 @, @a sub i,i @, @ a sub i,i+1 @, @a sub i+1,i @, @ a sub i+1,i+1 @, and @a sub i,i @, @ a sub i,i+2 @, @a sub i+2,i @, @ a sub i+2,i+2 @. .TE The order in which they are applied does not matter in exact arithmetic. We have not studied the numerical properties of this scheme however. .NH The Parallel Algorithm .PP Although it is fairly straightforward from Section 2 to see how to obtain a parallel algorithm, certain details are worth discussing further. We shall begin by describing the partitioning phase. This phase amounts to constructing a binary tree with each node representing a rank-one tear and hence a partition into two sub-problems. A tree of level 3 therefore represents a splitting of the original problem into 8 smaller eigenvalue problems. Thus, there are two standard symmetric tridiagonal eigenvalue problems to be solved at each leaf of the tree. Each of these problems may be spawned independently without fear of data conflicts. The tree is then traversed in reverse order with the eigenvalue updating routine SESUPD applied at each node joining the results from the left son and right son calculations. The leaves each define independent rank-one updating problems and again there are no data conflicts between them. The only data dependency at a node is that the left and right son calculations must have been completed. When this condition is satisfied, the results of two adjacent eigenvalue subproblems are ready to be joined through the rank-one updating process and this node may spawn the updating process immediately. Information required at a node to define the problem consists of the index of the element torn out, together with the dimension of the left and right son problems. For example, if @n ~=~ 50@ with a tree of height 3 we have .sp5 .KS .cs 1 24 .nf .ps 7 26 25 25 13 38 12 13 12 13 7 19 32 44 6 6 6 7 6 6 6 7 .fi .cs 1 .ps 11 .ce @Figure@ 2. The Computational Tree .KE .sp2 This tree defines 8 subproblems at the lowest level. There are two calls to TQL2 required at each leaf of the tree in Figure 2 to solve tridiagonal eigenvalue problems. The beginning indices of these problems are 1,7,13,19,26,32,38,44 and the dimension of each of them may be read off from left to right at the lowest level as 6,6,6,7,6,6,6,7 respectively. As soon as the calculation for the problems beginning at indices 1 and 7 have been completed a rank-one update may proceed on the problem beginning at index 1 with dimension 12. The remaining updating problems at this level begin at indices 13,26,38. There are then two updating problems at indices 1 and 26 each of dimension 25 and a final updating problem at index 1 of dimension 50. .PP Evidently, we lose a degree of large grain parallelism as we move up a level on the tree. However, there is more parallelism to be found at the root finding level and the amount of this increases as we travel up the tree so there is ample opportunity for load balancing in this scheme. The parallelism at the root finding level stems from the fact that each of the root calculations is independent and requires only read access to all but one array. That is the array that contains the diagonal entries of the matrix @ DELTA sub i@ described in Section 3. For computational efficiency we may decide on an advantageous number of processes to create at the outset. In the example above that number was 8. Then as we travel up the tree the root-finding procedure is split into 2,4,and finally 8 parallel parts in each node at level 3, 2, 1 respectively. As these computations are roughly equivalent in complexity on a given level it is reasonable to expect to keep all processors devoted to this computation busy throughout. .NH Implementation and Library Issues .PP The implementation described here is for computers with a shared memory architecture. The algorithm itself is not limited to this memory model however, Jessop [] has an implementation of Cuppen's algorithm for a hypercube. .PP Obviously, the implementation of this scheme is not so straightforward as simply parallelizing a loop using a fork-join construct. The synchronization mechanism requires more sophistication since it must be able to spawn processes at the root finding level based upon computation that has taken place. This process allocation problem is dynamic rather than static since, due to the possibility of deflation, it will not be known in advance how many roots will have to be calculated at a given level. If there are only a few roots to be found, then the desire for reasonable granularity will dictate fewer processes to be spawned. .PP In addition to requiring a certain level of sophistication in the synchronization scheme, we would like to adhere as much as possible to the principles of transportability. This algorithm is obviously a candidate for a library subroutine and potentially will find use on a wide variety of machines. When designing library subroutines, one wishes to conceal machine dependencies as much as possible from the user. These important considerations seem to be difficult to accommodate if we are to invoke parallelism at the level described above. It would appear that the user must be conscious of the number of parallel processes required by the library subroutines throughout his program. Should the library routines be called from multiple branches of a user's parallel program, then he could inadvertently attempt to create many more processes than allowed due to physical limitations. .PP We believe there is hope for implementing this algorithm while adhering to the goals set out in the previous two paragraphs. It may be accomplished by adopting a programming methodology that is inspired by work of Lusk and Overbeek [6,7] and by Babb [1] on a methodology for implementing transportable parallel codes. We use a package called SCHEDULE [] that we have been developing while this algorithm was being devised and tested. SCHEDULE is a package of Fortran subroutines designed to aid in programming explicitly parallel algorithms for numerical calculations. The design goal of SCHEDULE is to aid a programmer familiar with a Fortran programming environment to implement a parallel algorithm in a style of Fortran programming that will lend itself to transporting the resulting program across a wide variety of parallel machines. .NH Performance .PP In this section we present and analyze the results of this algorithm on a number of machines. The same algorithm has been run on a VAX 11/785 , Denelcor HEP, Alliant FX/8, and a CRAY X-MP-4. .PP We have compared our implementation of the algorithm described in this paper to TQL2 from the EISPACK collection. Table 8.1 gives the ratio of execution time for TQL2 from EISPACK run sequentially and the algorithm presented here run in parallel on the same machine. In all cases, the TQL2 code used to test against and the TQL2 code used at lowest level in our algorithm were exactly the same code. Both the parallel algorithm and the serial version of TQL2 were called by a common driver test routine during the same computer run. This assures that both routines were compiled under the same compiler options and ran in the same computing environment to produce this timing data. The timings for TQL2 were obtained by executing it as a single process, and it should be emphasized that in all cases the computations were carried out as though the tridiagonal matrix had come from Householder's reduction of a dense symmetric matrix to tridiagonal form. The identity was passed in place of the orthogonal basis that would have been provided by this reduction, but the arithmetic operations performed were the same as those that would have been required to transform that basis into the eigenvectors of the original symmetric matrix. .sp2 .KS .ce Table 8.1 .nr LL 6.i .ll 6.i .TS center; c|c c c c c c|c c c c c c|c c c c c. @n@ Ratio of time TQL2 SESUPD TQL2 SESUPD TQL2/SESUPD @ || Ax ~-~ lambda x || @ @ || Ax ~-~ lambda x || @ @ || X sup T X ~-~ I || @ @ || X sup T X ~-~ I || @ _ _ _ _ _ _ 50 .78 @1.6 times 10 sup -12 @ @4.8 times 10 sup -13 @ @1.9 times 10 sup -12@ @1.2 times 10 sup -13 @ 100 1.26 @2.6 times 10 sup -12 @ @7.6 times 10 sup -13 @ @3.9 times 10 sup -12@ @2.5 times 10 sup -12 @ 200 1.93 @6.1 times 10 sup -12 @ @1.8 times 10 sup -12 @ @7.0 times 10 sup -12@ @5.8 times 10 sup -11 @ 300 3.62 @1.1 times 10 sup -11 @ @3.3 times 10 sup -12 @ @1.1 times 10 sup -11@ @2.8 times 10 sup -12 @ 400 4.70 @1.2 times 10 sup -11 @ @3.8 times 10 sup -12 @ @1.4 times 10 sup -11@ @3.8 times 10 sup -12 @ .TE .ce .I A comparison on the CRAY X-MP for TQL2 vs the .ce parallel algorithm run sequentially .KE .R .sp .PP As can be seen in Table 8.2, the performance of the parallel algorithm as implemented to run on a sequential machine is quite impressive. The surprising result here is the observed speed up even in serial mode of execution. This is unusual in a parallel algorithm. Often more work is associated with synchronization and computational overhead required to split the problem into parallel parts. .sp2 .KS .ce Table 8.2 .nr LL 6.i .ll 6.i .TS center; c|cp8 cp8 cp8 cp8 cp8 cp8 c|np8 np8 np8 np8 np8 np8 c|np8 np8 np8 np8 np8 np8. VAX 785/FPA Denelcor HEP Alliant FX/8 CRAY X-MP-1 CRAY X-MP-4 _ random 2.6 12 15.2 1.8 4.5 matrix .TE .ce order = 150 .sp 2 .ce .I Ratio of execution time @{ TQL2 ~ time } over {parallel~time } @ .R .KE .PP These results are remarkable because in all cases speedups greater than the number of physical processors were obtained. The gain is due to the numerical properties of the deflation portion of the parallel algorithm. We do not full understand why the algorithm performs as much deflation as is apparent by the conparisons. In all cases the word length was 64 bits and the same level of accuracy was achieved by both methods. The measurement of accuracy used was the maximum 2-norm of the residuals @Tq ~-~lambda q @ and of the columns of @Q sup T Q ~-~ I @. The results in Table 8.1 are typical of the performance of this algorithm on random problems with speedups becoming more dramatic as the matrix order increases. In problems of order 500 speedups of 15 have been observed on the CRAY-XMP-4 and speedups over 50 have been observed on the Alliant FX/8, which are 4 and 8 processor machines respectively. The CRAY results can actually be improved because parallelism at the root finding level was not exploited in the implementation run on the CRAY but was fully exploited on the Alliant. .PP In Table 8.3 and 8.4 we show a more complete set of runs on the Alliant/FX8 .sp2 .KS .ce Table 8.3 .nr LL 6.i .ll 6.i .TS center; c|c c c c c c|c c c c c c|c c c c c. @n@ Ratio of time TQL2 SESUPD TQL2 SESUPD TQL2/SESUPD @ || Ax ~-~ lambda x || @ @ || Ax ~-~ lambda x || @ @ || X sup T X ~-~ I || @ @ || X sup T X ~-~ I || @ _ _ _ _ _ _ 100 9.4 @ 9.5 times 10 sup -15 @ @1.9 times 10 sup -15@ @ 7.8 times 10 sup -15 @ @5.5 times 10 sup -16 @ 200 15.4 @1.7 times 10 sup -14 @ @2.7 times 10 sup -15 @ @1.1 times 10 sup -14 @ @2.2 times 10 sup -15 @ 300 17.7 @1.9 times 10 sup -14 @ @3.2 times 10 sup -15 @ @1.7 times 10 sup -14 @ @2.6 times 10 sup -15 @ 400 20.0 @2.9 times 10 sup -14 @ @4.0 times 10 sup -15 @ @2.0 times 10 sup -14 @ @9.2 times 10 sup -15 @ .TE .I .ce A comparison on the Alliant FX/8 for TQL2 vs the .ce parallel algorithm on (1,2,1) matrix .R .KE .sp2 .KS .ce Table 8.4 .nr LL 6.i .ll 6.i .TS center; c|c c c c c c|c c c c c c|c c c c c. @n@ Ratio of time TQL2 SESUPD TQL2 SESUPD TQL2/SESUPD @ || Ax ~-~ lambda x || @ @ || Ax ~-~ lambda x || @ @ || X sup T X ~-~ I || @ @ || X sup T X ~-~ I || @ _ _ _ _ _ _ 100 12.1 @1.6 times 10 sup -14 @ @ 1.9 times 10 sup -13 @ @ 1.6 times 10 sup -14 @ @ 2.4 times 10 sup -15 @ 200 19.5 @2.1 times 10 sup -14 @ @ 2.2 times 10 sup -13 @ @ 2.8 times 10 sup -14 @ @ 2.3 times 10 sup -15 @ 300 38.8 @4.2 times 10 sup -14 @ @ 8.8 times 10 sup -13 @ @ 3.7 times 10 sup -14 @ @ 5.2 times 10 sup -15 @ 400 60.7 @3.5 times 10 sup -14 @ @ 8.2 times 10 sup -13 @ @ 3.7 times 10 sup -14 @ @ 4.6 times 10 sup -14 @ .TE .I .ce A comparison on the Alliant FX/8 for TQL2 vs the .ce parallel algorithm on a random matrix .R .KE .sp2 These test problems show two types of behavior. In the (1,2,1) matrix not very much deflation takes place. On the random matrix considerable deflation takes place. The observation of Cuppen that this deflation occurs in many cases was very fortunate. It brought our attention to this algorithm, but we did not really expect the remarkable performance observed here. In the case where many eigenvectors are sought along with the eigenvalues this algorithm seems to be very promising. However, as we mentioned in Section 5 it is not necessary to compute the eigenvectors if they are not needed. We cannot recommend this algorithm in the case where only a few eigenvalues and eigenvectors are sought. A multisectioning algorithm such as the one developed by Lo, Philippe and Sameh [5] is to be preferred in this case. .NH References .R .IP [1] R. G. Babb II, .I Parallel Processing with Large Grain Data Flow Techniques, .R IEEE Computer, Vol. 17, No. 7 , pp. 55-61, July 1984. .sp .R .IP [2] J.R. Bunch, C.P. Nielsen, and D.C. Sorensen, .I Rank-One Modification of the Symmetric Eigenproblem, .R Numerische Mathematik 31, pp. 31-48, 1978. .sp .IP [3] J.J.M. Cuppen .I A Divide and Conquer Method for the Symmetric Tridiagonal Eigenproblem, .R Numerische Mathematik 31, pp. 31-48, 1978. .sp .IP [4] G.H. Golub, .I Some Modified Matrix Eigenvalue Problems, .R SIAM Review, 15, pp. 318-334, 1973. .sp .IP [5] S. Lo, B. Philippe, A. Sameh, .I A Parallel Algorithm for the Real Symmetric Tridiagonal Eigenvalue Problem, .R Center for Supercomputing Research and Development Report, University of Illinois, Urbana Illinois, Nov. 1985. .sp .IP [6] E. Lusk and R. Overbeek, "Implementation of Monitors with Macros: A Programming Aid for the HEP and Other Parallel Processors", .R ANL-83-97, (1983). .sp .IP [7] E. Lusk and R. Overbeek, "An Approach to Programming Multiprocessing Algorithms on the Denelcor HEP", .R ANL-83-96, (1983). .sp .IP [8] J.J. More@prime@, .I The Levenberg-Marquardt Algorithm: Implementation and Theory, .R Proceedings of the Dundee Conference on Numerical Analysis, G.A. Watson ed. Springer-Verlag, 1978. .sp .IP [9] C.H. Reinsch, .I Smoothing by Spline Functions, .R Numerische Mathematik 10, pp. 177-183, 1967. .sp .IP [10] C.H. Reinsch, .I Smoothing by Spline Functions II, .R Numerische Mathematik 16, pp. 451-454, 1971. .sp .IP [11] B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe, V.C. Klema, and C.B. Moler, .I Matrix Eigensystem Routines - EISPACK Guide, .R Lecture Notes in Computer Science, Vol. 6, 2nd edition, Springer-Verlag, Berlin, 1976. .sp .IP [12] G.W. Stewart, .I Introduction to Matrix Computations, .R Academic Press, New York, 1973. .sp .IP [13] J.H. Wilkinson, .I The Algebraic Eigenvalue Problem, .R Clarendon Press, Oxford, 1965. .sp .IP [] L. Jessop, Yale Computer Science Research Report, .I "Solving the Symmetric Tridiagonal Eigenvalue Problem on the Hypercube" .R in preparation. .sp .IP [] D. Sorensen, .I SCHED Users Guide, .R ANL-MCS-TM-??, Argonne National Laboratory, May, 1986.