BlockSolve utilizes the preconditioned conjugate gradient algorithm for symmetric positive definite matrices and the preconditioned SYMMLQ algorithm for symmetric indefinite matrices. For basic information on these algorithms, we refer the reader to [2]. One important note is that the SYMMLQ algorithm requires a positive definite preconditioner, and this requirement can be a serious limitation if the matrix being solved is very indefinite.
The user has the option of selecting a combination of four preconditioners. The first option is a simple diagonal scaling of the matrix. We advocate always diagonally scaling the matrix, whether or not one of the others preconditioners is selected. The other preconditioning options are incomplete Cholesky factorization, SSOR (, and block Jacobi (where the blocks are the cliques of the graph associated with the sparse matrix). We recommend that the user select the incomplete Cholesky factorization with diagonal scaling for symmetric positive definite matrices. This is the algorithm that BlockSolve was designed for, and it has proven useful for a variety of practical problems.
BlockSolve does not partition the matrices across the processors for the user. BlockSolve simply accepts an already partitioned matrix with the assumption that the partitioning is a good one; its performance is limited by the quality of the partition. We believe that this is a reasonable approach for the linear system solver because the user must also have a good partition for the other aspects of an application to perform well. Therefore, we view the partitioning problem as a separate, but important problem. We assume that the right-hand side and the solution vector are partitioned in the same manner as the rows of the sparse matrix.
We achieve parallelism in the conjugate gradient (SYMMLQ) portion of the code by partitioning the vectors used in the algorithms in the same manner that the rows of the matrix are partitioned across the processors. Then it is a simple matter of executing inner products and daxpy's in parallel.
To achieve scalable parallel performance in the incomplete Cholesky and SSOR preconditioners, we color the graph of the sparse matrix using a parallel coloring algorithm [7]. The combination of coloring a general symmetric sparse matrix and the incomplete Cholesky algorithm has proven very successful for solving large problems on scalable parallel architectures [4], [6]. We have addressed the issue of convergence of this combination of algorithms in [5].
To achieve good performance on each node, we reorder the matrix to allow the use of the higher-level dense BLAS. This is particularly important on machines that use high-performance RISC chips on which good performance can be achieved only by using such operations. The reordering of the matrices is based on the identification of identical nodes and cliques in the graph associated with the matrix. Identical nodes typically exist when multiple degrees of freedom are associated with each node in the graph. Cliques are found in many graphs associated with sparse matrices, but larger ones are typically found in graphs where multiple degrees of freedom are associated with each node and the local connectivity of the graph is large. For example, if one uses a second-order, three-dimensional finite element in a typical structural engineering problem, clique sizes of up to 81 can be found. In general, the larger the cliques or identical nodes, the better the performance. This technique has been used with great success in direct matrix factorization methods.