Let A be a square n-by-n matrix. A scalar is called an eigenvalue and a non-zero column vector v the corresponding right eigenvector if . A nonzero column vector u satisfying is called the left eigenvector. The first basic task of the routines described in this section is to compute, for a given matrix A, all n values of and, if desired, their associated right eigenvectors v and/or left eigenvectors u.
A second basic task is to compute the Schur factorization of a matrix A. If A is complex, then its Schur factorization is , where Z is unitary and T is upper triangular. If A is real, its Schur factorization is , where Z is orthogonal. and T is upper quasi-triangular (1-by-1 and 2-by-2 blocks on its diagonal). The columns of Z are called the Schur vectors of A. The eigenvalues of A appear on the diagonal of T; complex conjugate eigenvalues of a real A correspond to 2-by-2 blocks on the diagonal of T.
We are implementing two parallel algorithms for computing the Schur factorization. The first is a parallelization of the conventional sequential algorithm, and the second is a novel divide-and-conquer algorithm. We discuss each briefly in turn.
The ScaLAPACK solution has elements in common with LAPACK. For example, both reduce to upper Hessenberg form initially. Both use an implicit QR algorithm based approach. The QR algorithm involves bulge chasing, and LAPACK uses a single bulge. Parallelism is obtained in ScaLAPACK by chasing multiple bulges that are staggered far enough apart so that all nodes have computation to perform (except during pipeline start-up and wind-down.) The broadcast of the Householder reflections which is critical to any QR algorithm is blocked independently of the distributed block size. The application of the Householder reflections are also done in a block fashion (independent of the above broadcast blocking) to increase data re-use.
Our second algorithm is based on the sign function of a matrix [3, 4, 5]. This algorithm offers the flexibility of computing just part of the Schur form (corresponding, say, to the eigenvalues with positive real parts), uses more highly parallelized building blocks from ScaLAPACK than the HQR algorithm in the last paragraph (matrix inversion, matrix multiply and QR factorization), but also requires more floating point operations; it remains to be seen for which problems and on which machines which algorithm is faster. The sign function also entails a dynamic load balancing scheme [7] to implement its divide-and-conquer approach most efficiently.