A standard technique in parallel computing is to build new algorithms from existing high performance building blocks. For example, the LAPACK linear algebra library [1] is written in terms of the Basic Linear Algebra Subroutines (BLAS)[22][23][38], for which efficient implementations are available on many workstations, vector processors, and shared memory parallel machines. The recently released ScaLAPACK 1.0(beta) linear algebra library [26] is written in terms of the Parallel Block BLAS (PB-BLAS) [15], Basic Linear Algebra Communication Subroutines (BLACS) [25], BLAS and LAPACK. ScaLAPACK includes routines for LU, QR and Cholesky factorizations, and matrix inversion, and has been ported to the Intel Gamma, Delta and Paragon, Thinking Machines CM-5, and PVM clusters. The Connection Machine Scientific Software Library (CMSSL)[54] provides analogous functionality and high performance for the CM-5.
In this work, we use these high performance kernels to build two new algorithms for finding eigenvalues and invariant subspaces of nonsymmetric matrices on distributed memory parallel computers. These algorithms perform spectral divide and conquer, i.e. they recursively divide the matrix into smaller submatrices, each of which has a subset of the original eigenvalues as its own. One algorithm uses the matrix sign function evaluated with Newton iteration [4][6][42][8]. The other algorithm avoids the matrix inverse required by Newton iteration, and so is called the inverse free algorithm [7][44][10][30]. Both algorithms are simply constructed from a small set of highly parallelizable building blocks, including matrix multiplication, QR decomposition and matrix inversion, as we describe in section 2.
By using existing high performance kernels in ScaLAPACK and CMSSL, we have achieved high efficiency. On a 256 processor Intel Touchstone Delta system, the sign function algorithm reached 31% efficiency with respect to the underlying matrix multiplication (PUMMA [16]) for 4000-by-4000 matrices, and 82% efficiency with respect to the underlying ScaLAPACK 1.0 matrix inversion. On a 32 processor Thinking Machines CM-5 with vector units, the hybrid Newton-Schultz sign function algorithm obtained 41% efficiency with respect to matrix multiplication from CMSSL 3.2 for 2048-by-2048 matrices.
The nonsymmetric spectral decomposition problem has until recently resisted attempts at parallelization. The conventional method is to use the Hessenberg QR algorithm. One first reduces the matrix to Schur form, and then swaps the desired eigenvalues along the diagonal to group them together in order to form the desired invariant subspace [1]. The algorithm had appeared to required fine grain parallelism and be difficult to parallelize [57][27][5], but recently Henry and van de Geijn[32] have shown that the Hessenburg QR algorithm phase can be effectively parallelized for distributed memory parallel computers with up to 100 processors. Although parallel QR does not appear to be as scalable as the algorithms presented in this paper, it may be faster on a wide range of distributed memory parallel computers. Our algorithms perform several times as many floating point operations as QR, but they are nearly all within Level 3 BLAS, whereas implementations of QR performing the fewest floating point operations use less efficient Level 1 and 2 BLAS. A thorough comparison of these algorithms will be the subject of a future paper.
Other parallel eigenproblem algorithms which have been developed include earlier parallelizations of the QR algorithm [55][56][50][29], Hessenberg divide and conquer algorithm using either Newton's method [24] or homotopies [40][39][17], and Jacobi's method [49][48][47][28]. All these methods suffer from the use of fine-grain parallelism, instability, slow or misconvergence in the presence of clustered eigenvalues of the original problem or some constructed subproblems [20], or all three.
The methods in this paper may be less stable than QR algorithm, and may fail to converge in a number of circumstances. Fortunately, it is easy to detect and compensate for this loss of stability, by choosing to divide the spectrum in a slightly different location. Compared with the other approaches mentioned above, we believe the algorithms discussed in this paper offer an effective tradeoff between parallelizability and stability.
The other algorithms most closely related to the approaches used here may be found in [36][9][3], where symmetric matrices, or more generally matrices with real spectra, are treated.
Another advantage of the algorithms described in this paper is that they can compute just those eigenvalues (and the corresponding invariant subspace) in a user-specified region of the complex plane. To help the user specify this region, we will describe a graphical user interface for the algorithms.
The rest of this paper is organized as follows. In section 2, we present our two algorithms for spectral divide and conquer in a single framework, show how to divide the spectrum along arbitrary circles and lines in the complex plane, and discuss implementation details. In section 3, we discuss the performance of our algorithms on the Intel Delta and CM-5. In section 4, we present a model for the performance of our algorithms, and demonstrate that it can predict the execution time reasonably accurately. Section 5 describes the design of an X-window user interface. Section 6 draws conclusions and outlines our future work.