.ds TH "\h'-.50m'\v'-.40m'^\v'.40m'\h'.50m'
.ds TD "\h'-.65m'\v'-.40m'~\v'.40m'
.ds CT "\(mu\h'-.66m'\(ci
.ds NI "\(mo\h'-.65m'/
.nr EP 1
.nr VS 14
.EQ
gsize 11
delim @@
... define | '^\(br^'
... define || '\(br ^ \(br '
define notin '\(mo back 30 / '
define times '\(mu'
define ctimes '\(mu back 80 \(ci ^ '
define hhat ' up 5 hat down 5 '
define bl '\0'
.EN
.TL
A Fully Parallel Algorithm for the Symmetric Eigenvalue Problem
.AU
.ps 11
.in 0
J.J. Dongarra and D. C. Sorensen
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'   
Argonne National Laboratory\h'.20i'   
9700 South Cass Avenue\h'.20i'
Argonne, Illinois 60439\h'.20i'
.nr PS 11
.ps 11
.nr VS 16
.vs 16
.nr PD 0.5v
.FS
Work supported in part by the Applied Mathematical Sciences
subprogram of the Office of Energy Research, 
U.S. Department of Energy under Contracts W-31-109-Eng-38,
DE-AC05-840R21400 and DE-FG02-85ER25001.

Presented at the Second SIAM Conference on Vector and Parallel
Processing in Scientific Computing, Norfolk, Virginia, November 20, 1985.
.FE
.ps 10
.AB
.PP
In this paper we present a parallel algorithm for the symmetric algebraic
eigenvalue problem.  The algorithm is based upon a divide and conquer scheme
suggested by Cuppen for computing the eigensystem of a symmetric tridiagonal
matrix.  We extend this idea to obtain a parallel algorithm 
that retains a number of active parallel processes that is greater than or i
equal to the initial number throughout the course
of the computation.  
We give a new deflation
technique which together with a robust root finding technique will 
assure computation of an eigensystem to full accuracy in the residuals
and in the orthogonality of eigenvectors.   A brief analysis of the numerical
properties and sensitivity to round off error is presented to indicate where 
numerical difficulties may occur.  
The algorithm is able to 
exploit parallelism at all levels of the computation and is well suited to 
a variety of architectures.  
.PP
Computational results are presented for several
machines.  These results are very encouraging with
respect to both accuracy and speedup. A surprising result is that     
the parallel algorithm, even when run in serial mode, can be significantly 
faster than the previously best sequential 
algorithm on large problems, and is effective on moderate size 
problems when run in serial mode.
.NH
Introduction
.PP
The symmetric eigenvalue problem is one of the most fundamental problems
of computational mathematics.  It arises in many applications and therefore
represents an important area for algorithmic research.  The problem has received
considerable attention in the literature and was probably the
first algebraic eigenvalue problem for which reliable methods were 
obtained.
It would be surprising therefore, if a new method were to be found that would
offer a significant improvement in execution time
over the fundamental algorithms available in
standard software packages such as EISPACK [11].  However, it is reasonable to 
expect that eigenvalue calculations might be accelerated through the use
of parallel algorithms for parallel computers that are becoming available.  
We shall present such an algorithm in this paper.  The algorithm is able to 
exploit parallelism at all levels of the computation and is well suited to 
a variety of architectures.  However, the surprising result is that     
the parallel algorithm, even when run in serial mode, is significantly 
faster than the previously best sequential 
algorithm on large problems, and is effective on moderate size (order @>=@ 30)
problems when run in serial mode.
.PP
The problem we consider is the following: Given a real @n times n @ 
symmetric matrix @A@, find all of the eigenvalues and corresponding 
eigenvectors of @A@.  It is well known [12] that under these assumptions
.EQ (1.1)
   A ~=~ Q D Q sup T ~,~~~ roman { with } ~ Q sup T Q ~=~ I ~,
.EN
so that the columns of the matrix @Q@ are the orthonormal eigenvectors of
@A@ and @D ~=~ diag( delta sub 1 ^,^  delta sub 2 ^,^...,^ delta sub n )@
is the diagonal matrix of eigenvalues.  The standard algorithm for computing
this decomposition is first to use a finite algorithm to reduce @A@ to 
tridiagonal form using a sequence of Householder transformations, and then to
apply a version of the @QR@-algorithm to obtain all the eigenvalues
and eigenvectors of the tridiagonal matrix[12].  The primary purpose of this
paper is to describe a method for 
parallelizing the computation of the eigensystem of the tridiagonal matrix.
However, the method we present is intended to be used in conjunction with
the initial reduction to tridiagonal form to compute the complete eigensystem
of the original matrix @A@.  We, therefore, briefly touch upon the issues
involved in parallelizing this initial reduction and suggest a way to combine
the parallel initial reduction with the tridiagonal algorithm to obtain
a fully parallel algorithm for the symmetric eigenvalue problem.
.PP
The method is based upon a divide and conquer algorithm suggested by Cuppen[3].
A fundamental tool used to implement this algorithm is a method 
that was developed by Bunch, Nielsen, and Sorensen[2] for updating 
the eigensystem of a symmetric matrix after modification by a rank one change.
This rank-one updating method was inspired by some earlier work of Golub[4]
on modified eigenvalue problems.  The basic idea of the new method is to
use rank-one modifications to 
tear out selected off-diagonal elements of the tridiagonal problem in order to 
introduce a number of independent subproblems of smaller size.  The subproblems
are solved at the lowest level using the subroutine TQL2 from EISPACK and then
results of these problems are successively glued together using the rank-one 
modification
routine SESUPD that we have developed based upon the ideas presented in [2].
.PP
In the following discussion we describe the partitioning of the tridiagonal 
problem into smaller problems by rank-one tearing. Then we describe the 
numerical algorithm for gluing the results back together.  The organization of
the parallel algorithm is laid out, and finally some computational
results are presented.  Throughout this paper we adhere to the convention
that capital Roman letters represent matrices, lower case Roman letters 
represent column vectors, and lower case Greek letters represent scalars.
A superscript @T@ denotes transpose.
All matrices and vectors are real, but the results are easily extended
to matrices over the complex field.
.NH
Partitioning by Rank-One Tearing
.PP
The crux of the algorithm is to divide a given problem into two smaller 
subproblems.  To do this, we consider the symmetric tridiagonal matrix
.EQ (2.1)
T mark ~=~ left ( ~ ~
matrix { ccol { T sub 1 above beta e sub 1 e sub k sup T }
         ccol { beta e sub k e sub 1 sup T  above T sub 2 } } ~ ~ right )
.EN
.EQ
   lineup ~=~ left ( ~ 
matrix { ccol { T hat sub 1 above 0 }
         ccol { 0 above T hat sub 2 }} ~ right ) ~+~ theta beta 
left ( ~ 
matrix { ccol { e sub k above theta sup -1 e sub 1} }  ~ right )
                ( e sub k sup T ~,~ theta sup -1 e sub 1 sup T )
.EN
where @ 1 ~<=~ k ~<=~ n @ and @ e sub j @ represents the @j-th@ unit vector
of appropriate dimension.  The @k-th@ diagonal element of @ T sub 1@ has been
modified to give @T hat sub 1 @ and the first diagonal element of @ T sub 2@ has
been modified to give @ T hat sub 2 @.  Potential numerical difficulties 
associated with cancellation may be avoided through the appropriate choice 
of @theta@.  If the diagonal entries to be 
modified are of the same sign then @ theta ~= ~+-~ 1@
is chosen so that @- theta beta @ has this sign and cancellation is 
avoided.  If the two diagonal entries are of opposite sign, then the sign of 
@theta@ is chosen so that @- theta beta@ has the same sign as one of the 
elements and the magnitude of @theta@ is chosen to avoid severe loss of 
significant digits when @theta sup -1  beta@ is subtracted from the other.
This is perhaps a minor detail, but it does allow the partitioning to 
be selected solely on the basis of position and without regard to numerical
considerations.
.PP
Now we have two smaller tridiagonal eigenvalue problems to solve.  
According to equation 
(1.1) we compute the two eigensystems
.EQ
   T hat sub 1 ~=~ Q sub 1 D sub 1 Q sub 1 sup T ~,~~~~
   T hat sub 2 ~=~ Q sub 2 D sub 2 Q sub 2 sup T ~.
.EN
This gives
.EQ (2.2)
   T mark ~=~ left ( ~ 
matrix { ccol { Q sub 1 D sub 1 Q sub 1 sup T above 0 }
         ccol { 0 above Q sub 2 D sub 2 Q sub 2 sup T } } ~ right ) ~
     +~ theta beta left ( ~ 
matrix { ccol { e sub k above theta sup -1 e sub 1}  } ~ right )
( e sub k sup T ~,~ theta sup -1 e sub 1 sup T )
.EN
.EQ
lineup ~=~ left ( ~ matrix { ccol { Q sub 1 above 0 }
                           ccol { 0 above Q sub 2 } }  ~ right )
left (~
left ( ~ matrix { ccol { D sub 1 above 0 }
                ccol { 0 above D sub 2 } }  ~ right )
+~ theta beta left ( ~ matrix { ccol { q sub 1 above theta sup -1 q sub 2} }  
~ right )
( q sub 1 sup T ~,~ theta sup -1 q sub 2 sup T ) ~ ~ right )
left ( ~ matrix { ccol { Q sub 1 sup T  above 0 }
                ccol { 0 above Q sub 2 sup T  } }  ~ right )
.EN
where @ q sub 1 ~=~ Q sub 1 sup T e sub k@ 
and  @ q sub 2 ~=~ Q sub 2 sup T e sub 1@.  The problem at hand now is to 
compute the eigensystem of the interior matrix in equation (2.2).  
A numerical method for solving this problem has been provided in [2] and 
we shall discuss this method in the next section.  
.PP
It should be fairly obvious how to proceed from here to exploit parallelism.
One simply repeats the tearing on each of the two halves recursively until
the original problem has been divided into the desired number of subproblems
and then the rank one modification routine may be applied from bottom
up to glue the results together again.
.NH
The Updating Problem
.PP
The general problem we are required to solve is that of computing 
the eigensystem of a matrix of the form
.EQ (3.1)
   
     Q hat D hat Q hat sup T ~=~ D ~+~ rho zz sup T
.EN
where @ D @ is a real @n times n@ diagonal matrix, @rho@ is a scalar,
and @z@ is a real vector of order @n@ .  It is assumed without loss of 
generality that @z@ has Euclidean norm 1. 
.PP
We seek a formula for an eigen-pair for the matrix on the right hand
side of (3.1) .  If @ q ^,^ lambda @ is such an eigen-pair then
.EQ
   ( D ~+~ rho z z sup T ) q ~=~ lambda q
.EN
and a simple rearrangement of terms gives
.EQ
    (D ~-~ lambda I )q ~=~ - rho ( z sup T q ) z ~.
.EN
Multiplying on the left by @ z sup T ( D ~-~ lambda I ) sup -1 @ gives
.sp
.EQ
   z sup T q ~=~ -^ rho (z sup T q ) z sup T (D ~-~ lambda I ) sup -1  z ~,
.EN
and hence
.EQ 
    1 ~+~ rho z sup T (D ~-~ lambda I ) sup -1  z ~=~ 0
.EN
must be satisfied by @lambda@.  We have tacitly assumed @z sup T q ~!= ~ 0 @,
does not imply @Dq ~=~ lambda q@ hence @lambda ~=~ delta sub i @
@q ~=~ e sub i @ for some @i@ .
If we have @ 0 ~=~ zeta sub i ~==~ e sub i sup T z @
then @ q ~=~ e sub i @ is an eigenvector of @ D ~+~ rho z z sup T @.
We have just shown that
if @D ~=~ diag( 
  delta sub 1 ^,^ delta sub 2 ^,^ ... , delta sub n ^ ) @
with @ delta sub 1 ^<^ delta sub 2 ^<^ ... <^ delta sub n  @
and no component @zeta sub i@ of the vector @z@ is zero, then the 
updated eigenvalues @ delta \*(TH sub i @ are roots of the equation
.EQ (3.2)
 f ( lambda ) ~==~  1 ~+~ rho sum from {j = 1 } to n { zeta sub j sup 2 } over 
     { delta sub j ~-~ lambda } ~=~ 0 .
.EN 
Golub[4] refers to this as the secular equation and 
the behavior of its roots is pictorially described by the following graph:
.KS
.sp 20
.ce
 @Figure@ 1.  The Secular Equation   
.KE
.sp
.PP
Moreover, as shown in [2] the 
eigenvectors (i.e. the columns of @Q hat@ in (3.1))
are given by the formula 
.EQ (3.3)
     
       q hat sub i ~=~ gamma sub i DELTA sub i sup -1 z 
.EN
with @gamma sub i @ chosen to make @ || ^ q hat sub i ^ || ~=~ 1 @ , 
and with @ DELTA sub i ~=~ diag ( delta sub 1 - delta \*(TH sub i ^,^
 delta sub 2 - delta \*(TH sub i ^,^...^,^ delta sub n - delta \*(TH sub i ~)@.
Due to this structure, an excellent numerical method may be devised to find
the roots of the secular equation and as a by-product to compute the 
eigenvectors to full accuracy.  It must be stressed, however, that great care
must be exercised 
in the numerical method used to solve the secular equation and to construct
the eigenvectors from formula (3.3).  
.PP
In the following discussion we assume that @rho ~>~ 0@ in (3.2).  A simple
change of variables may always be used to achieve this, so there is 
no loss of generality.  The method we shall describe was inspired by the 
work of More@prime@ [8] and Reinsch[9,10], and relies on
the use of simple rational approximations to construct an iterative method
for the solution of equation (3.2).  Given that we wish to find the @i-th@ 
root @delta \*(TH sub i@ of the function @f@ in (3.2) we may write this
function as 
.EQ
   f( lambda ) ~=~ 1 ~+~ phi ^ ( lambda ) ~+~ psi ^ ( lambda )
.EN
where
.EQ 
     psi ^ ( lambda ) ~=~ rho 
 sum from {j ^=^ 1 } to i { zeta sub j sup 2 } over { delta sub j ~-~ lambda }
.EN
and
.EQ
 phi ^ ( lambda ) ~==~   rho sum from {j ^=^ i+1 } to n { zeta sub j sup 2 } 
  over { delta sub j ~-~ lambda }.
.EN
.sp
From the graph in Figure 1 it is seen that the root @ delta \*(TH sub i @ lies
in the open interval @ ( delta sub i ^,^ delta sub i+1 )@ and for @lambda@ in
this interval all of the terms of @psi@ are negative and all of the terms of
@phi@ are positive.  We may derive an iterative method for solving the equation
.EQ
  - psi ^ ( lambda ) ~=~ 1 ~+~ phi ^ ( lambda ) 
.EN
by starting with an initial guess @lambda sub 0@ in the appropriate interval
and then constructing simple rational interpolants of the form
.EQ
 p over {q ~-~ lambda } ~,~ ~ r ~+~ s over { delta ~-~ lambda }
.EN
where @delta@ is fixed at the current value of @ delta \*(TH sub i+1 @ , and the 
parameters @p,~q,~r,~s@  are defined by the interpolation conditions
.EQ (3.4)
 p over {q ~-~ lambda sub 0  } ~=~  psi ^ ( lambda sub 0  )  ~,~ ~ 
  r ~+~ s over { delta ~-~ lambda sub 0  } ~=~ phi ^ ( lambda sub 0  )
.EN
.EQ
 p over { ( q ~-~ lambda sub 0  ) sup 2 } ~=~ psi ^ prime ( lambda sub 0 ) ~,~ ~
 s over { ( delta ~-~ lambda sub 0 ) sup 2  } ~=~ phi ^ prime ( lambda sub 0 )~.
.EN
The new approximate @ lambda sub 1 @ to the root @ delta \*(TH sub i @ is then
found by solving 
.EQ (3.5)
 {-^ p} over {q ~-~ lambda } ~=~ 1 ~+~ r ~+~ s over { delta ~-~ lambda }~.
.EN
It is possible to construct an initial guess which lies in the open interval
(@ delta sub i ^,^ delta \*(TH sub i @).  A sequence of 
iterates {@ lambda sub k @} may then be constructed as we have just described 
with @lambda sub k+1@ being derived from @lambda sub k @ as @lambda sub 1@ was 
derived from @ lambda sub 0@ above.  The following theorem, proved in [2]
then shows that this iteration converges quadratically from one
side of the root and does not need any safeguarding.
.sp2
@THEOREM@ (3.6)
.I
Let @rho ~>~ 0  @ in @(3.2)@.  If the initial iterate @ lambda sub 0 @ lies in
the open interval  @(^ delta sub i ^,^ delta \*(TH sub i ^)@ then the sequence
of iterates { @ lambda sub k @ } as constructed in equations @(3.4)-(3.5)@ are
well defined and 
satisfy @ lambda sub k ~<~ lambda sub k+1 ~<~ delta \*(TH sub i @
for all @k ~>=~ 0 @.  Moreover, the sequence converges quadratically to the 
root @ delta \*(TH sub i @.
.R
.PP
In our implementation of this scheme equation (3.5) is cast in such a way that 
we solve for the iterative correction @ tau ~=~ lambda sub 1 - lambda sub 0 @ .
The quantities @ delta sub j ~-~ lambda sub k@  which are used later in the 
eigenvector calculations are maintained and these
iterative corrections are applied directly to them as well as to the eigenvalue
approximation.  Cancellation in the computation of these differences
is thus avoided  because the corrections
become smaller and smaller and are eventually applied to the lowest order
bits.  These values are then used directly in the calculation of the updated
eigenvectors to obtain the highest possible accuracy. 
The rapid convergence of the iterative method allows the specification of 
very stringent convergence criteria that will ensure a relative residual
and orthogonality of eigenvectors to full machine accuracy.  
The stopping criterion used is
.EQ
 (i)~ | f( lambda ) | ~~<=~~ eta ^  max( | delta sub 1 | ^,^ | delta sub 2 |  )
 ~~ and ~~
 (ii)   | tau | ~~<=~~ eta ^ min ( | delta sub i ~-~ lambda | ^,^ 
                                 | delta sub i ~-~ lambda | ) ,
.EN
where @ lambda @ is the current iterate and @ tau @ is the last iterative correction
that was computed.  The condition on @tau@ is very stringent.  The purpose
of such a stringent stopping criteria will be clarified in the next section
when we discuss orthogonality of computed eigenvectors.  Let it suffice at 
this point to say that condition @(i)@ assures a small residual and that
@(ii)@ assures orthogonal eigenvectors.
.PP
In most problems these criteria are easily satisfied.  However, there are
pathological situations involving nearly equal eigenvalues that make @(ii)@
very difficult to satisfy.  It is here that the basic root finding method 
described above must be modified.  The problem stems from the fact that 
the iterative corrections cease to modify the value of the value of @f@
due to round off error.  When working in single precision, this situation 
can be rectified            
through the use of extended precision accumulation of inner products in the 
evaluation of @f@ and its derivative.   However, this becomes less 
attractive when working in 64-bit arithmetic as is done in 
most scientific calculations.  Whether or not we shall be able to provide
a root finder that will satisfy these stringent requirements in 64-bit 
arithmetic for very pathological cases remains to be seen.
.PP
This has been a brief description of the rank-one updating scheme.  Full 
theoretical details are available in [2].  
This calculation
represents the workhorse of the parallel algorithm.
The method seems to be very robust in practice and exhibits high accuracy.
It does not seem to 
suffer the effects of nearly equal roots as Cuppen suggests [3] but instead
was able to solve such ill conditioned problems as the Wilkinson matrices
@W sub 2k+1 sup + @ [13 p.308] to full
machine precision and with slightly better residual and orthogonality 
properties than the standard algorithm TQL2 from EISPACK.  In the 
next section we offer some reasons for this behavior.
.NH 
Deflation and Orthogonality of Eigenvectors
.PP 
At the outset of this discussion we made the assumption that the diagonal
elements of @D@ were distinct and that no component of the vector @z@
was zero.  These conditions are not satisfied in general, so deflation
techniques must be employed to ensure their  satisfaction. 
A deflation technique was suggested in [2] to provide distinct
eigenvalues which amounts to rotating the basis for the eigenspace
corresponding to a multiple eigenvalue so that only one component of 
the vector @ z @ corresponding to this space is nonzero in the new 
basis.  Those terms in (3.2) corresponding to zero components of 
@z@ may simply be dropped.  The eigenvalues and eigenvectors
corresponding to these zero components remain static.  
In finite precision
arithmetic the situation becomes more interesting.  Terms corresponding to
small components of @z@ may be dropped.  This can have a very
dramatic effect upon the amount of work required in our parallel 
method.  As first observed by Cuppen[3], there can be significant
deflation in the updating process as the original matrix is 
rebuilt from the subproblems.
.PP
Deflation can occur in two ways, either through small components of @z@ or
through nearly equal eigenvalues.  
We shall 
refine these notions in this section to include "nearly zero" components
of @z@ and "nearly equal" eigenvalues.  
Analysis of the first situation is 
straightforward.  The second situation can be quite delicate in 
certain pathological cases, however,
so we shall discuss it here in some detail.  It turns out that the 
case of nearly equal eigenvalues may be reduced to the case of small 
components of @z@.
.PP
It is straightforward to see that when @z sup T e sub i ~=~ 0@ the 
@i-th @ eigenvalue and corresponding eigenvector  of @D@ will be 
an eigen-pair for @ D ~+~ rho z z sup T @.  Let us ask the question
"when is an eigen-pair of @D@ a good approximation to an eigen-pair for the 
modified matrix?"  This 
question is easily answered.  Recall that @ || ^ z ^ || ~=~ 1 @, so
.EQ
  || ^  ( D ~+~ rho z z sup T ) e sub i ~-~ delta sub i e sub i ^ ||
 ~=~ | rho  zeta sub i |^ || ^ z ^ || ^ ~=~ | rho zeta sub i | .
.EN
Thus we may accept this eigen-pair as an eigen-pair for the modified matrix
whenever
.EQ
       | rho zeta sub i | ~<=~ tol
.EN
where @tol ~>~ 0 @ is our error tolerance.  
Typically @ tol ~=~ macheps ^ ( ^ eta ^ || A || ^ ) ^  @ , where @macheps@ is 
machine 
precision and @eta@ is a constant of order unity.  However, at each stage of 
the updating process we may use 
.EQ
    tol ~=~ macheps ^ eta ^ ( ^
           max ( | delta sub 1 | ^,^ | delta sub n | ) ~+~ | rho | ^)
.EN
since at every stage this will represent a bound on the spectral radius of a
principal submatrix of @A@.
Let us now suppose there are two eigenvalues of @D@ separated by @epsilon@
so that @epsilon ~=~ delta sub i+1 ~-~ delta sub i @.  
Consider the 2 @times@ 2 submatrix 
.EQ
  left (^ matrix { ccol { delta sub i above bl} 
                   ccol {bl above delta sub i+1 } } 
                                                ^ right ) ~
~+~ rho
left ( ~ 
matrix { ccol { zeta sub i above zeta sub i+1} }  ~ right )
                ( zeta sub i ~,~ zeta sub i+1 )
.EN
of @D ~+~ rho z z sup T  @ and let us 
construct a Givens transformation to introduce a zero in one of the two
corresponding components in @z@.  Then 
.EQ
  left (^ matrix { ccol { gamma above - sigma} 
                   ccol {sigma above gamma } } ^ right ) ~

  left ( ^ 
  left (^ matrix { ccol { delta sub i above bl} 
                   ccol {bl above delta sub i+1 } } 
                                                ^ right ) ~
~+~ rho
left ( ~ 
matrix { ccol { zeta sub i above zeta sub i+1} }  ~ right )
                ( zeta sub i ~,~ zeta sub i+1 )
                                                ^ right ) ~
  left (^ matrix { ccol { gamma above sigma} 
                   ccol {- sigma above gamma } } ^ right ) ~
.EN
.EQ
  =~ left (^ matrix { ccol { delta \*(TH sub i  above bl} 
                   ccol {bl above delta \*(TH sub i+1 } } 
                                                ^ right ) ~
~+~ rho
left ( ~ 
matrix { ccol { tau  above 0} }  ~ right )
                ( tau  ~,~ 0 )
 ~+~ epsilon ^ sigma ^ gamma ^
  left (^ matrix { ccol { 0 above 1} 
                   ccol { 1 above  0 } } ^ right ) ~
.EN
where @gamma sup 2 ~+~ sigma sup 2 ~=~ 1@ , 
@delta \*(TH sub i ~=~ delta sub i gamma sup 2 ~+~ delta sub i+1 sigma sup 2 ~@
@delta \*(TH sub i+1 ~=~ delta sub i sigma sup 2 ~+~ delta sub i+1 gamma sup 2 ~@
and @ tau sup 2 ~=~ zeta sub i sup 2 + zeta sub i+1 sup 2 @.
Now, if we put @G sub i @ equal to the @n times n@  Givens transformation 
constructed from the identity by replacing the appropriate diagonal 
block with the @2 times 2@ rotation just constructed, we have
.sp
.EQ (4.1)

  G sub i (^ D ~+~ rho z z sup T ^  ) G sub i sup T
     ~=~ D hat  ~+~ rho z hat  { z hat }  sup T  ~+~~ E sub i
.EN
.sp
with @e sub i sup T z hat ~=~ tau @ and @e sub i+1 sup T z hat ~=~ 0 @, 
@ e sub i sup T D hat e sub i ~=~ delta @ , 
@ e sub i+1 sup T D hat e sub i+1 ~=~ delta @ , and
.EQ
    || ^ E sub i ^ ||  ~=~ | epsilon ^ sigma ^ gamma | ~~.
.EN
If we choose, instead, to zero out the @i-th@ component of @z@ then
we simply apply @G sub i @ on the right and its transpose on the left in
equation(4.1) to obtain the desired similar result.  The only exception 
to the previous result is that
the sign of the matrix @E sub i @ is reversed.  Of course this deflation
is only done when
.EQ
   | epsilon ^  gamma ^ sigma |   ~<=~ tol
.EN
is satisfied.
.PP
The result of applying all of the deflations is 
to replace the updating
problem (3.1) with one of smaller size.  When appropriate, this is 
accomplished by applying similarity transformations consisting of several
Givens transformations.  If @G@ represents the product of these transformations
the result is 
.EQ
   G(~D ~+~ rho z z sup T ~ ) G sup T ~=~ 
      left ( ~ 
      matrix { 
        ccol {{ ~D sub 1 ~-~ rho z sub 1 z sub 1 sup T } above 0}  
        ccol { 0 above D sub 2}}
      ~right ) 
   ~~+~~ E ~,
.EN
where 
.EQ
  || ~ E ~ || ~<=~ tol ^  eta sub 2 
.EN
with @ eta sub 2 @ of order unity .  The cumulative
effect of such errors is additive and thus the final computed 
eigensystem @ Q hat D hat Q hat sup T @ which satisfies
.EQ
   || ~ A ~ - ~ Q hat D hat Q hat sup T || ~<=~ eta sub 3 tol 
.EN
where @ eta sub 3  @ is again of order 1 in magnitude.  The reduction in size of 
@ D sub 1 ~-~ rho z sub 1 z sub 1 sup T @ over the original rank 1
modification can be spectacular in certain cases.  
The effects of such deflation can be dramatic, for the amount
of computation required to perform the updating is greatly reduced.  
.PP
Let us now consider the possible limitations on orthogonality of
eigenvectors due to nearly equal roots. Our first result is a perturbation
lemma that will indicate the inherent difficulty associated with 
nearly equal roots.
.sp2
@LEMMA (4.2)@ 
.I 
Let 
.EQ (4.3)
  q sub lambda sup T  ~==~ ( ^ 
  { zeta sub 1 }  over { delta sub 1 - lambda }  ^,^
  { zeta sub 2 }  over { delta sub 2 - lambda }  ^,^...^
  { zeta sub n }  over { delta sub n - lambda }  ^ )  
  left [ {  rho  } over { f prime ( lambda ) } right ] sup half  ,
.EN
where f is defined by formula (3.2).  Then for 
any @ lambda , mu  ~ notin ~ "{" delta sub i ^:^ i ~=~ 1,...,n "}" @
.EQ (4.4)
  | q sub lambda sup T q sub mu | ~=~ 
  1 over { | lambda ~-~ mu | }  {{ | ^ f ( lambda ) ~-~ f ( mu ) ^ | } 
        over {  left [   f prime ( lambda ) f prime ( mu )  
                right ]  sup half }}  .
.EN
.sp
@PROOF@: 
.R
Note that 
.EQ (4.5)
 q sub lambda sup T q sub mu ~=~   
  left (
   sum from {j = 1 } to n { zeta sub j sup 2 } over 
     { (^ delta sub j ~-~ lambda ^)   (^ delta sub j ~-~ mu ^)  } 
  right ) 
   { | ^ rho ^ | }
    over { left [ f prime ( lambda ) f prime ( mu )  right ] } sup half ~.
.EN 
But,
.EQ
    { lambda ~-~ mu }  over
     { (^ delta sub j ~-~ lambda ^)   (^ delta sub j ~-~ mu ^)  } 
   ~=~ 
       	1 over { (^ delta sub j ~-~ lambda ^)} ~-~ 
        1 over {   (^ delta sub j ~-~ mu ^)  } .
.EN
Thus
.EQ
   ( lambda ~-~ mu ) rho sum from {j = 1 } to n { zeta sub j sup 2 } over 
     { (^ delta sub j ~-~ lambda ^)   (^ delta sub j ~-~ mu ^)  } 
 ~=~ 
   rho sum from {j = 1 } to n { zeta sub j sup 2 } over 
     {  delta sub j ~-~ lambda } 
  ~-~
   rho  sum from {j = 1 } to n { zeta sub j sup 2 } over 
     { delta sub j ~-~ mu   }  ~=~ f( lambda ) ~-~ f ( mu )
.EN
and the result follows . @\(sq@
.PP
Note that in (4.3) @ q sub lambda @ is always a vector of unit length, and
the set of @n@ vectors selected by setting @lambda@ equal to the roots of
the secular equation is the set of eigenvectors for @ D ~+~ rho z z sup T @.  
Moreover, equation (4.4) shows that those eigenvectors are mutually 
orthogonal whenever @ lambda @ and @ mu @ are set to distinct roots of @f@.
Finally, the term @ | lambda ~-~ mu | @ appearing in the denominator of (4.4)
sends up a warning that it may be difficult to attain orthogonal eigenvectors
when the roots @lambda @ and @ mu @ are close.  We wish to examine this
situation now.  We will show that, as a result of the deflation process,
the @ "{" delta sub i "}" @ are sufficiently separated, and the 
weights @ zeta sub i @ are uniformly large enough that the roots of @f@ 
are bounded away from each other.  This statement is made explicit in
the following
.sp2
@LEMMA@ (4.6)
.I
Let @lambda@ be the root of @f@ in the @i - th @ sub 
interval (@ delta sub i@,@ delta sub i+1 @ ).  If the deflation test 
is satisfied then either
.EQ (i)
  | ^ delta sub i+1 ~-~ lambda ^ | ~>=~ half | delta sub i+1 ~-~ delta sub i |
 ~and~ 
  | ^ delta sub i  ~-~ lambda ^ | ~>=~ tol sup 2 over 
          { | rho ( ^ delta sub i+1 - delta sub i ^ )  | ~+~ 2 rho sup 2 } ~,
.EN
or
.EQ (ii)
  | ^ delta sub i ~-~ lambda ^ | ~>=~ half | delta sub i+1 ~-~ delta sub i |
 ~and~ 
  | ^ delta sub i+1  ~-~ lambda ^ | ~>=~  { tol sup 2 } over { 2 rho sup 2 }
                         ( delta sub i+1 ~-~ delta sub i )
.EN
.sp
.R
@PROOF@ :  
In case (i)  the fact that @f( lambda ) ~=~ 0 @ provides
.EQ
  rho sum from {j = 1 } to i 
    { zeta sub j sup 2 } over 
     { lambda ~-~ delta sub j } ~=~ 
    1 ~+~ rho sum from {j = i+1 } to n { zeta sub j sup 2 } over 
     { delta sub j ~-~ lambda }  .
.EN
From this equation we find that
.EQ
      | rho | { zeta sub i sup 2 } over 
     { lambda ~-~ delta sub i } ~<=~ 1 + { | rho | } over  
                           { delta sub i+1 ~-~ lambda }  
      ~<=~ 1 + { 2 | rho | }  over 
                           { delta sub i+1 ~-~ delta sub i }  ,
.EN
and it follows readily that
.EQ
| delta sub i+1 ~-~ delta sub i | 
  left ( { | rho | zeta sub i sup 2 } over
{ | delta sub i+1 ~-~ delta sub i | ~+~ 2 | rho | } right ) 
     ~<=~ | delta sub i ~-~ lambda |
.EN
The deflation rules assure us 
that @| rho | zeta sub i sup 2 ~>=~ tol sup 2 / | rho | @
and the result follows.  Case (ii) is similar. @\(sq@
.PP
The form of the result given in (4.6) will have importance in the following
lemma which gives better insight to the quality of numerical orthogonality
attainable with this scheme.  The bounds obtained are certainly less than
one would hope for.  Moreover, it is unfortunate that only the magnitude
of the weights enter into the estimate.  This obviously does not take into 
account the deflation due to nearly equal eigenvalues.  Unfortunately, we 
have not been able to improve these estimates by other means and are left
with this crude bound.
.sp2
@LEMMA@ (4.7)
.I
Suppose that @ lambda \*(TH @ and @ mu hat @ are numerical approximations
to exact roots @ lambda @ and @ mu @ of @f@.   Assume that
these roots are distinct and let the relative errors for the quantities
@ delta sub i ~-~ lambda @ and @ delta sub i ~-~ mu @ be denoted by 
@ theta sub i @ and @ eta sub i @ respectively.    That is, the computed
quantities 
.EQ (4.8)
  delta sub i ~-~ lambda \*(TH ~=~ ( delta sub i ~-~ lambda )( 1 ~+~ theta sub i )
 ~ ~and ~ ~
  delta sub i ~-~ mu hat  ~=~ ( delta sub i ~-~ mu ) ( 1 ~+~ eta sub i ),
.EN
for @ i ~=~ 1,2,...,n@ .
Let @ q sub lambda hat @ and @ q sub mu hat @ be defined according to 
formula (4.3) using the computed quantities given in (4.8).
If @ | theta sub i | , | eta sub i | ~<=~ epsilon ~ << ~ 1 @, then 
.EQ
  | q sub lambda hat sup T q sub mu hat | ~=~ 
  | q sub lambda sup T E q sub mu   | ~<=~ 
  epsilon ^ ( 2 ^+^ epsilon ) 
  left ( { 1 ^+^ epsilon } over { 1 ^-^ epsilon } right ) sup 2 ~,
.EN
with @E@ a diagonal matrix whose @i-th@ diagonal element is 
.EQ
  E sub ii ~=~ { theta sub i ~+~ eta sub i ~+~ theta sub i eta sub i }
          over { ( ^ 1 ~+~ theta sub i ^) ( ^ 1 ~+~ eta sub i ^) }
        left [ { f prime ( lambda ) f prime ( mu ) } 
  over { f prime ( lambda \*(TH ) f prime ( mu hat ) }  right ] sup half .
.EN
.sp
.R
@PROOF@ :  From formula (4.3) we have 
.EQ
   -^ q sub lambda hat sup T q sub mu hat  ~ =~ 
  -^ left ( ^
   sum from {j = 1 } to n { zeta sub j sup 2 } over 
   { (  delta sub j ~-~ lambda ) ( delta sub j ~-~ mu ) 
    (  1 ~+~ theta sub j ) ( 1 ~+~ eta sub j ) }  
   ^ right ) 
  { | ^ rho ^ | }
  over { 
    left [ 
    f prime ( lambda \*(TH ) f prime ( mu hat )   right ] sup half } 
.EN
.EQ
   ~=~
  left ( ^
   sum from {j = 1 } to n { zeta sub j sup 2 } over 
   { (  delta sub j ~-~ lambda ) ( delta sub j ~-~ mu ) }
 ~-~
   sum from {j = 1 } to n { zeta sub j sup 2 } over 
   { (  delta sub j ~-~ lambda ) ( delta sub j ~-~ mu ) 
    (  1 ~+~ theta sub j ) ( 1 ~+~ eta sub j ) }  
  ^ right )
  { | ^ rho ^ | }
  over { 
    left [ 
    f prime ( lambda \*(TH ) f prime ( mu hat )   right ] sup half } 
.EN
due to the orthogonality of the exact vectors.  Thus, 
.EQ
  | q sub lambda hat sup T q sub mu hat | ~ =~ 
   | sum from {j = 1 } to n  
   left ( ^
   { zeta sub j sup 2 } over 
   { (  delta sub j ~-~ lambda ) ( delta sub j ~-~ mu ) } 
   ^ right )
    left ( ^ 1 ~-~
          1 over { ( ^ 1 ~+~ theta sub i ^) ( ^ 1 ~+~ eta sub j ^) }
     ^ right ) 
  rho 
  over { 
    left [ 
    f prime ( lambda \*(TH ) f prime ( mu hat )   right ] sup half } | .
.EN
Since the quantity 
.EQ
   { f prime ( lambda ) } 
  over { f prime ( lambda \*(TH ) }  
 ~=~ 
 {
   sum from {j = 1 } to n { zeta sub j sup 2 } over 
   { (  delta sub j ~-~ lambda ) sup 2  } 
 } over {
   sum from {j = 1 } to n { zeta sub j sup 2 } over 
   { (  delta sub j ~-~ lambda )  sup 2 (  1 ~+~ theta sub j ) sup 2 }
 }
 ~<= ~ ( 1 ~+~ epsilon ) sup 2
.EN
and since 
.EQ
   max E sub ii ~<=~ { epsilon ~+~ epsilon ~+~ epsilon sup 2 } over
                  {   (1 ~-~ epsilon ) ( 1 ~-~ epsilon ) }
.EN
the result follows. @\(sq@
.PP
This shows that orthogonality can be assured whenever
it is possible to provide small relative errors when computing the 
differences @ delta sub i ~-~ lambda @.  Since, as we mentioned in Section 3,
these quantities are updated with the iterative corrections to @ lambda @ and
since deflation has guaranteed that the quantities in Lemma (4.6) are 
of bounded away from zero, it will be possible in theory to provide 
small relative errors.
Obviously, the analysis given here, simple as it may be, could form the core
of a rigorous error analysis of the method.  As yet we lack a root finder
that could be proven to satisfy numerically the relative error requirements. 
The one described above will do so in finite precision but may require extended
precision accumulation of inner products in pathological cases.  However, in 
practice we have not experienced
difficulty with the exception of contrived examples.  Cuppen [3] has shown
that the computed eigenvectors may be safely reorthogonalized when needed.
We hope to avoid this alternative though.
.NH
Reduction to Tridiagonal Form and Related Issues 
.PP
This algorithm is designed to work in conjunction with Householders reduction
of a symmetric matrix to tridiagonal form .  In this standard technique
a sequence of @n-1@ Householder transformations is applied 
to form 
.EQ
   A sub k+1 ~=~ (I~-~ alpha sub k w sub k w sub k sup T )
                A sub k (I~-~ alpha sub k w sub k w sub k sup T )~,
.EN
with 
.EQ
    A sub k ~=~ left ( matrix {
                  ccol { T sub k above 0 }
                  ccol { 0 above A hat sub k } } right ) ~,
.EN
where @ T sub k @ is a tridiagonal matrix of order @k-1@. See [11]
for details.  We note here that we 
can form @(I~-~ alpha ww sup T )A(I~-~ alpha w w sup T ) @ 
using the following algorithm.
.KS
.TS
center;
l.
\f1Algorithm 5.1    
1. @v ~=~ Aw@
2. @y sup T ~=~ v sup T ~-~ alpha ( w sup T v ) w sup T@
3. Replace @A @ by @ A ~-~ alpha vw sup T ~-~ alpha w y sup T@
.TE
.KE
Steps 1 and 3 may all be parallelized because @A@ may be partitioned
into blocks of columns @A ~=~  ( A sub 1 ^,^ A sub 2 ^,^ ... ^,^ A sub j )@
and each of the contributions corresponding to these columns may 
be carried out independently in steps 1 and 3.  For example in step 3,
.EQ
   A sub i ~<-~ A sub i ~-~ alpha v w sub i sup T 
                          ~-~ alpha w y sub i sup T
.EN
where the vectors @w@ and @y@ have been partitioned in a corresponding manner.
In step 2 the partial results will have to be stored in temporary locations
until all are completed and then they may be added together to obtain the
final result.  Note also that when the calculations are arranged this
way advantage may be taken of vector operations when they are available.
.PP
Algorithm 5.1 has some disadvantages when incorporated into the reduction of
a symmetric matrix to tridiagonal form.   At the k-th stage of the 
reduction we have 
.EQ 
left ( ~ ~
  matrix { ccol { T sup (k) above beta e sub 1 e sub k sup T }
         ccol { beta e sub k e sub 1 sup T  above A sup (k) } } ~ ~ right )
.EN
The reduction is advanced one step through the application of Algorithm 5.1
to the submatrix @ A sup (k) @.  
First, a fork-join synchronization construct is imposed since the matrix-vector
product requires the entire matrix @ A sup (k) @ to be in place before this 
product can be completed, so that no portion of Step 2 may begin until Step 1
is finished.  Second, due to symmetry, only the lower triangle of @ A sup (k) @
need be computed and this implies that vector lengths shorten during the
computation.  
.PP
When used in conjunction with the rank one tearing scheme,  these drawbacks 
may be overcome.  
Suppose the final result of this decomposition is @T@ and that this matrix 
would be partitioned into @ (~ T sub 1 ^,^ T sub 2 ^,...,^ T sub m ~) @ by
the rank one tearing if it were known.  It is not necessary to wait until the
entire reduction is completed, for @T sup (k) @ represents a leading principal
submatrix of @T@.  Thus,
as soon as the first sub-tridiagonal matrix  @ T sub 1 @ is 
exposed the process of computing its eigen system may be initiated.  
Similarly, as soon as @ T sub 2 @ is exposed its eigensystem may be computed.
Then a rank one update may occur, and so on.  
In this way a number of independent processes may be spawned early on and the
number of such processes ready to execute will remain above a reasonable 
height throughout the course of the computation.
An efficient implementation of
this scheme is difficult and will be the subject of a subsequent paper.
Issues such as proper level of partitioning will have added importance due
to the desire to have parallel processes set in motion as soon as possible.
.PP
When we do not wish to find eigenvectors there is no reason to store the 
product @Q@ of these Householder transformations.  Nor is it necessary to 
accumulate the product of the successive eigenvector transformations 
resulting from the updating problem.  That is, we do not need to 
overwrite @Q@ with 
.EQ
 Q ~<-~ Q 
  left ( ~ matrix { ccol { Q sub 1 above 0 }
                           ccol { 0 above Q sub 2 } }  ~ right ) Q hat
.EN
where @Q sub 1 @,  @Q sub 2 @ and @Q hat @  are the matrices 
appearing in equations (2.2) and (3.1) above.
Instead, we may simply discard @Q@.   Then the vector @q sub 1 @
may be formed as @T hat sub 1 @ is transformed to @D sub 1@ in (2.2)
by accumulating the products of the transformations constructed
in TQL2 that make up 
@Q sub 1@ against the vector @ e sub k @ .  If there is more than
one division then @Q sub 1 @ will have been calculated with
the updating scheme.  In this case we do not calculate all of @Q sub 1@
but instead
use the component-wise formula for the eigenvectors to pick 
off the appropriate components needed to form @q sub 1 @.
.PP
This algorithm can be generalized to handle band matrices as well.
Instead of performing a rank-one tearing to split the matrix into
two independent subproblems, we proceed by making @(m+1)m/2@
rank-one changes designed to split the matrix, (@m@ is the half
bandwidth, @m ~=~ 1@ for tridiagonal matrices).

.KS
.TS
center;
c c c c c c c c c c c
c c c c c c c c c c c
c c c c c c|c c c c c
c c c c c c|c c c c c
c c c c c c|c c c c c
c c c c c c|c c c c c
c c c c c c|c c c c c
c c c c c c c c c c c.
	.	.	.
	x	x	x	x	x
		x	x	x	x	x
@ i sup th@ row			x	x	x	x	x
				_	_	_	_
@ i+1 sup th@ row				x	x	x	x	x
					x	x	x	x	x
						x	x	x	x	x
								.	.	.
.TE
.KE

The three rank-one changes for this band matrix, (@m~=~2@), involve
the following elements:
.TS
center;
l.
@a sub i-1,i-1 @, @ a sub i-1,i+1 @, 
@a sub i+1,i-1 @, @ a sub i+1,i+1 @, 

@a sub i,i @, @ a sub i,i+1 @, 
@a sub i+1,i @, @ a sub i+1,i+1 @,     and

@a sub i,i @, @ a sub i,i+2 @, 
@a sub i+2,i @, @ a sub i+2,i+2 @.
.TE
The order in which they are applied does not matter in exact arithmetic.
We have not studied the numerical properties of this scheme however.
.NH
The Parallel Algorithm
.PP
Although it is fairly straightforward from Section 2 to see how to obtain
a parallel algorithm, certain details are worth discussing further.  We 
shall begin by describing the partitioning phase.  This phase amounts to 
constructing a binary tree with each node representing a rank-one tear and
hence a partition into two sub-problems.   A tree of level 3 therefore 
represents a splitting of the original problem into 8 smaller eigenvalue
problems.
Thus, there are two standard symmetric tridiagonal eigenvalue problems to be
solved at each leaf of the tree.  Each of these problems may be spawned 
independently without fear of data conflicts.  The tree is then traversed in
reverse order with the eigenvalue updating routine SESUPD applied at each
node joining the results from the left son and right son calculations.  The
leaves each define independent rank-one updating problems and again there
are no data conflicts between them.  The only data dependency at a node is that 
the left and right son calculations must have been completed.  
When this condition is
satisfied, the results of two adjacent eigenvalue subproblems are ready to be
joined through the rank-one updating process and this node may spawn the 
updating process immediately.
Information required at a node to define the problem consists of the index
of the element torn out, together with the dimension of the left and right son
problems.  For example, if @n ~=~ 50@ with a tree of height 3 we have
.sp5
.KS
.cs 1 24 
.nf
.ps 7

                                             26      
                                                 
                                          25    25  


                           13                                     38     
                                                                 
                        12    13                               12    13  


                 7                   19                 32                   44      
                                                                            
             6      6             6      7           6      6              6     7   


.fi
.cs 1
.ps 11
.ce
 @Figure@ 2.  The Computational Tree 
.KE
.sp2
This tree defines 8 subproblems at the lowest level. There are two calls
to TQL2 required at each leaf of the tree in Figure 2 to solve tridiagonal
eigenvalue problems.  The beginning
indices of these problems are 1,7,13,19,26,32,38,44 and the dimension
of each of them may be read off from left to right at the
lowest level as 6,6,6,7,6,6,6,7 respectively.  As soon as the calculation 
for the problems beginning at indices 1 and 7 have been completed a rank-one
update may proceed on the problem beginning at index 1 with dimension 12.
The remaining updating problems at this level begin at indices 13,26,38.
There are then two updating problems at indices 1 and 26 each of dimension 25
and a final updating problem at index 1 of dimension 50.
.PP
Evidently, we lose a degree of large grain parallelism as we move up a
level on the tree.
However, there is more parallelism to be found at the 
root finding level and the amount of this increases as we travel up the tree
so there is ample opportunity for load balancing in this scheme.  The 
parallelism at the root finding level stems from the fact that each of the
root calculations is independent and requires only read access to all but
one array.  That is the array that contains the diagonal entries of the 
matrix @ DELTA sub i@ described in Section 3.  
For computational efficiency we may decide on an advantageous number of 
processes to create at the outset.  In the example above that number was 8.
Then as we travel up the tree the root-finding procedure is split into
2,4,and finally 8 parallel parts in each node at level 3, 2, 1 respectively.  
As these computations are roughly equivalent in complexity on a given level
it is reasonable to expect to keep all processors devoted to this computation
busy throughout.
.NH
Implementation and Library Issues
.PP
The implementation described here is for computers
with a shared memory architecture. The algorithm itself is not limited to this
memory model however, Jessop [] has an implementation of Cuppen's
algorithm for a hypercube.
.PP
Obviously, the implementation of this scheme is not so straightforward as
simply parallelizing a loop using a fork-join construct.   The synchronization
mechanism requires more sophistication since it must be able to spawn processes
at the root finding level based upon computation that has taken place.
This process allocation problem is dynamic rather than static since,
due to the possibility of deflation, it will not be known in advance
how many roots will have to be calculated at a given level.  If there are 
only a few roots to be found,
then the desire for reasonable granularity will dictate
fewer processes to be spawned.  
.PP
In addition to requiring a certain level of sophistication
in the synchronization scheme, 
we would like to adhere as much as possible to the principles of 
transportability.  This algorithm is obviously a candidate for a library 
subroutine and potentially will find use on a wide variety of machines.
When designing library subroutines, one wishes 
to conceal machine dependencies as much as possible from the user.  
These important considerations seem to be difficult to accommodate
if we are to invoke 
parallelism at the level described above. It would appear that 
the user must be conscious of the number of parallel processes required by
the library subroutines throughout his program.  Should the library
routines be called from multiple branches of a user's parallel program, then he
could inadvertently attempt to create many more processes than allowed due
to physical limitations.
.PP
We believe there is hope for implementing this algorithm while 
adhering  to the goals set out in the previous two paragraphs.  It may 
be accomplished by adopting a programming methodology that is inspired by 
work of Lusk and Overbeek [6,7] and by Babb [1] on a methodology for 
implementing transportable 
parallel codes. 
We use a package called SCHEDULE [] that we
have been developing while this algorithm was being devised and tested.  
SCHEDULE is a package of Fortran subroutines designed to aid in programming
explicitly parallel algorithms for numerical calculations.
The design goal 
of SCHEDULE is to aid a programmer familiar with a Fortran programming
environment to implement a parallel algorithm in a style of Fortran programming
that will lend itself to transporting the resulting program across a wide
variety of parallel machines.
.NH
Performance
.PP
In this section we present and analyze the results of this
algorithm on a number of machines. The same algorithm has
been run on a VAX 11/785 , Denelcor HEP, Alliant FX/8, and a CRAY X-MP-4.  
.PP
We have compared our implementation of the algorithm described in this paper
to TQL2 from the EISPACK collection. Table 8.1 gives the ratio of execution time
for TQL2 from EISPACK run sequentially and the algorithm presented
here run in parallel on the same machine. In all cases, the TQL2 code used 
to test against and the TQL2 code used at
lowest level in our algorithm were exactly the same code.
Both the parallel
algorithm and the serial version of TQL2 were called by a common driver test
routine during the same computer run.  This assures that both routines were
compiled under the same compiler options and ran in the same computing
environment to produce this timing data.
The timings for TQL2 were obtained by executing it as a single process,  
and it should be emphasized that in
all cases the computations were carried out as though the tridiagonal matrix
had come from Householder's reduction of a dense symmetric matrix to 
tridiagonal form.  The identity was passed in place of the orthogonal basis
that would have been provided by this reduction, 
but the arithmetic operations performed were the 
same as those that would have been required to transform that basis 
into the eigenvectors of the original symmetric matrix.

.sp2
.KS
.ce
Table 8.1
.nr LL 6.i
.ll 6.i
.TS
center;
c|c c c c c
c|c c c c c
c|c c c c c.
@n@	Ratio of time	TQL2	SESUPD	TQL2	SESUPD
	TQL2/SESUPD	@ || Ax ~-~ lambda x || @	@ || Ax ~-~ lambda x || @	@ || X sup T X ~-~ I || @	@ || X sup T X ~-~ I || @
_	_	_	_	_	_
50	.78	@1.6 times 10 sup -12 @	@4.8 times 10 sup -13 @	@1.9 times 10 sup -12@	@1.2 times 10 sup -13 @
100	1.26	@2.6 times 10 sup -12 @	@7.6 times 10 sup -13 @	@3.9 times 10 sup -12@	@2.5 times 10 sup -12 @
200	1.93	@6.1 times 10 sup -12 @	@1.8 times 10 sup -12 @	@7.0 times 10 sup -12@	@5.8 times 10 sup -11 @
300	3.62	@1.1 times 10 sup -11 @	@3.3 times 10 sup -12 @	@1.1 times 10 sup -11@	@2.8 times 10 sup -12 @
400	4.70	@1.2 times 10 sup -11 @	@3.8 times 10 sup -12 @	@1.4 times 10 sup -11@	@3.8 times 10 sup -12 @
.TE
.ce
.I
A comparison on the CRAY X-MP for TQL2 vs the 
.ce
parallel algorithm run sequentially
.KE
.R
.sp
.PP
As can be seen in Table 8.2, the performance of the parallel algorithm as implemented
to run on a sequential machine is quite impressive.
The surprising result here is the observed speed up even in serial mode
of execution.   This is unusual in a parallel algorithm.  Often more work
is associated with synchronization and computational overhead required
to split the problem into parallel parts.  
.sp2
.KS
.ce
Table 8.2
.nr LL 6.i
.ll 6.i
.TS
center;
c|cp8 cp8 cp8 cp8 cp8 cp8
c|np8 np8 np8 np8 np8 np8
c|np8 np8 np8 np8 np8 np8.
	VAX 785/FPA	Denelcor HEP	Alliant FX/8	CRAY X-MP-1	CRAY X-MP-4
_

random	2.6	12	15.2	1.8	4.5
matrix
.TE

.ce
order = 150
.sp 2
.ce
.I
Ratio of execution time @{ TQL2 ~ time  } over {parallel~time } @
.R
.KE
.PP
These results are remarkable because in all cases speedups
greater than the number of physical processors were obtained.  
The gain is due to the numerical properties of the deflation portion
of the parallel algorithm.  
We do not full understand why the algorithm
performs as much deflation as is apparent by the
conparisons. In all cases the word length was 64 bits
and the same level of accuracy was achieved by both methods.  The 
measurement of accuracy used was the maximum 2-norm 
of the residuals @Tq ~-~lambda q @
and of the columns of @Q sup T Q ~-~ I @.  The results in Table 8.1 are typical
of the performance of this algorithm on random problems with speedups
becoming more dramatic as the matrix order increases.  In problems
of order 500 speedups of 15 have been observed on the CRAY-XMP-4
and speedups over 50 have been observed on the Alliant FX/8, which are 
4 and 8 processor machines respectively.  The CRAY results can actually
be improved because parallelism at the root finding level was not 
exploited in the implementation run on the CRAY but was fully exploited
on the Alliant.  
.PP
In Table 8.3 and 8.4
we show a more complete set of runs on the Alliant/FX8
.sp2
.KS
.ce
Table 8.3
.nr LL 6.i
.ll 6.i
.TS
center;
c|c c c c c
c|c c c c c
c|c c c c c.
@n@	Ratio of time	TQL2	SESUPD	TQL2	SESUPD
	TQL2/SESUPD	@ || Ax ~-~ lambda x || @	@ || Ax ~-~ lambda x || @	@ || X sup T X ~-~ I || @	@ || X sup T X ~-~ I || @
_	_	_	_	_	_
100	   9.4 	@ 9.5 times 10 sup -15 @ 	@1.9 times 10 sup -15@	 @ 7.8 times 10 sup -15 @	 @5.5 times 10 sup -16 @
200	  15.4	  @1.7 times 10 sup -14 @	 @2.7 times 10 sup -15 @	 @1.1 times 10 sup -14 @	 @2.2 times 10 sup -15 @
300 	 17.7 	 @1.9 times 10 sup -14 @	 @3.2 times 10 sup -15 @	 @1.7 times 10 sup -14 @	 @2.6 times 10 sup -15 @
400 	 20.0	  @2.9 times 10 sup -14 @	 @4.0 times 10 sup -15 @	 @2.0 times 10 sup -14 @	 @9.2 times 10 sup -15 @
.TE
.I
.ce
A comparison on the Alliant FX/8 for TQL2 vs the 
.ce
parallel algorithm on (1,2,1) matrix
.R
.KE
.sp2
.KS
.ce
Table 8.4
.nr LL 6.i
.ll 6.i
.TS
center;
c|c c c c c
c|c c c c c
c|c c c c c.
@n@	Ratio of time	TQL2	SESUPD	TQL2	SESUPD
	TQL2/SESUPD	@ || Ax ~-~ lambda x || @	@ || Ax ~-~ lambda x || @	@ || X sup T X ~-~ I || @	@ || X sup T X ~-~ I || @
_	_	_	_	_	_
100	 12.1	 @1.6 times 10 sup -14 @	 @ 1.9 times 10 sup -13 @	 @ 1.6 times 10 sup -14 @	 @ 2.4 times 10 sup -15 @
200	 19.5	 @2.1 times 10 sup -14 @	 @ 2.2 times 10 sup -13 @	 @ 2.8 times 10 sup -14 @	 @ 2.3 times 10 sup -15 @
300	 38.8	 @4.2 times 10 sup -14 @	 @ 8.8 times 10 sup -13 @	 @ 3.7 times 10 sup -14 @	 @ 5.2 times 10 sup -15 @
400	 60.7	 @3.5 times 10 sup -14 @	 @ 8.2 times 10 sup -13 @	 @ 3.7 times 10 sup -14 @	 @ 4.6 times 10 sup -14 @
.TE
.I
.ce
A comparison on the Alliant FX/8 for TQL2 vs the 
.ce
parallel algorithm on a random matrix
.R
.KE
.sp2
These test problems show two types of behavior.  In the (1,2,1) matrix
not very much deflation takes place.  On the random matrix considerable
deflation takes place.  
The observation of Cuppen that this deflation occurs in
many cases was very fortunate.  It brought our attention to this algorithm,
but we did not really expect the remarkable performance observed here.  In
the case where many eigenvectors are sought along with the eigenvalues
this algorithm seems to be very promising.  However, as we mentioned in
Section 5  it is not necessary to compute the eigenvectors if they are
not needed.  We cannot recommend this algorithm in the case 
where only a few eigenvalues and eigenvectors are sought.  A multisectioning
algorithm such as the one developed by Lo, Philippe and Sameh [5] is 
to be preferred in this case.
.NH
References   
.R
.IP [1]
R. G. Babb II, 
.I
Parallel Processing with Large Grain Data Flow Techniques,
.R
IEEE Computer, Vol. 17, No. 7 , pp. 55-61, July 1984.
.sp
.R
.IP [2]
J.R. Bunch, C.P. Nielsen, and D.C. Sorensen,
.I
Rank-One Modification of the Symmetric Eigenproblem,
.R
Numerische Mathematik 31, pp. 31-48, 1978.
.sp
.IP [3]
J.J.M. Cuppen
.I
A Divide and Conquer Method for the Symmetric Tridiagonal Eigenproblem,
.R
Numerische Mathematik 31, pp. 31-48, 1978.
.sp
.IP [4]
G.H. Golub, 
.I
Some Modified Matrix Eigenvalue Problems,
.R
SIAM Review, 15, pp. 318-334, 1973.
.sp
.IP [5]
S. Lo, B. Philippe, A. Sameh,
.I
A Parallel Algorithm for the Real Symmetric Tridiagonal Eigenvalue Problem,
.R
Center for Supercomputing Research and Development Report, University
of Illinois, Urbana Illinois, Nov. 1985.
.sp
.IP [6]
E. Lusk and R. Overbeek, 
"Implementation of Monitors with Macros: A 
Programming Aid for the HEP and Other Parallel Processors",
.R
ANL-83-97, (1983).
.sp
.IP [7]
E. Lusk and R. Overbeek, 
"An Approach to Programming Multiprocessing 
Algorithms on the Denelcor HEP",
.R
ANL-83-96, (1983).
.sp
.IP [8]
J.J. More@prime@, 
.I
The Levenberg-Marquardt Algorithm: Implementation and Theory,
.R
Proceedings of the Dundee Conference  on Numerical Analysis, G.A. Watson ed.
Springer-Verlag, 1978.
.sp
.IP [9]
C.H. Reinsch, 
.I
Smoothing by Spline Functions,
.R
Numerische Mathematik 10, pp. 177-183, 1967.
.sp
.IP [10]
C.H. Reinsch, 
.I
Smoothing by Spline Functions II,
.R
Numerische Mathematik 16, pp. 451-454, 1971.
.sp
.IP [11]
B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe,
V.C. Klema, and C.B. Moler, 
.I
Matrix Eigensystem Routines - EISPACK Guide, 
.R
Lecture Notes in Computer Science, Vol. 6, 2nd edition, 
Springer-Verlag, Berlin, 1976.
.sp
.IP [12]
G.W. Stewart,
.I
Introduction to Matrix Computations,
.R
Academic Press, New York, 1973.
.sp
.IP [13]
J.H. Wilkinson,
.I
The Algebraic Eigenvalue Problem,
.R
Clarendon Press, Oxford, 1965.
.sp
.IP []
L. Jessop, Yale Computer Science Research Report,
.I
"Solving the Symmetric Tridiagonal Eigenvalue Problem
on the Hypercube"
.R
in preparation.
.sp
.IP []
D. Sorensen, 
.I
SCHED Users Guide,
.R
ANL-MCS-TM-??, Argonne National Laboratory,
May, 1986.