Concurrent matrix algorithms were among the first to be studied on the hypercubes at Caltech [Fox:82a], and they have also been intensively studied at other institutions, notably Yale [Ipsen:87b], [Johnsson:87b;89a], [Saad:85a], and Oak Ridge National Laboratory [Geist:86a;89a], [Romine:87a;90a]. The motivation for this interest is the fact that matrix algorithms play a prominent role in many scientific and engineering computations. In this chapter, we study the so called-full or dense (and closely related banded) matrix algorithms where essentially all elements of the matrix are nonzero. In Chapters 9 and 12, we will treat the more common case of problems, which, if formulated as matrix equations, are represented by sparse matrices. Here, most of the elements of the matrix are zero; one can apply full matrix algorithms to such sparse cases, but there are much better algorithms that exploit the sparseness to reduce the computational complexity. Within CP, we found two classes of important problems that needed full matrix algorithms. In Sections 8.2 and 8.3, we cover two chemical scattering problems, which involve relatively small full matrices-where the matrix rows and columns are labelled by the different reaction channels. The currently most important real-world use of full matrix algorithms is computational electromagnetic simulations [Edelman:92a]. These are used by the defense industry to design aircraft and other military vehicles with low radar cross sections. Solutions of large sets of linear equations come from the method of moments approach to electromagnetic equations [Wang:91b]. We investigated this method successfully at JPL [Simoni:89a] but in this book, we only describe (in Section 9.4) the alternative approaches, finite elements, to electromagnetic simulations. Such sparse matrix formulation will be more efficient for large electromagnetic problems, but the moment method and associated full matrix is probably the most common and numerically reliable approach.
Early work at Caltech on full matrices (1983 to 1987) focused on specific algorithms, such as matrix multiplication, matrix-vector products, and LU decomposition. A major issue in determining the optimal algorithm for these problems is choosing a decomposition which has good load balance and low communication overhead. Many matrix algorithms proceed in a series of steps in which rows and/or columns are successively made inactive. The scattered decomposition described in Section 8.1.2 is usually used to balance the load in such cases. The block decomposition, also described in Section 8.1.2, generally minimizes the amount of data communicated, but results in sending several short messages rather than a few longer messages. Thus, a block decomposition is optimal for a multiprocessor with low message latency, or startup cost, such as the Caltech/JPL Mark II hypercube. For machines with high message latency, such as the Intel iPSC/1, a row decomposition may be preferable. The best decomposition, therefore, depends crucially on the characteristics of the concurrent hardware.
In recent years (1988-1990), interest has centered on the development of libraries of concurrent linear algebra routines. As discussed in Section 8.1.7, two approaches have been followed at Caltech. One approach by Fox, et al. has led to a library of routines that are optimal for low latency, homogeneous hypercubes, such as the Caltech/JPL Mark II hypercubes. In contrast, Van de Velde has developed a library of routines that are generally suboptimal, but which may be ported to a wider range of multiprocessor architectures, and are suitable for incorporation into programs with dynamically changing data distributions.
HPFA Applications and Paradigms