The computer used for this work is a 64-processor Mark IIIfp hypercube. The Crystalline Operating System (CrOS)-channel-addressed synchronous communication provides the library routines to handle communications between nodes [Fox:85d;85h;88a]. The programs are written in C programming language except for the time-consuming two-dimensional quadratures and matrix inversions, which are optimized in assembly language.
The hypercube was configured as a two-dimensional array of processors. The mapping is done using binary Gray codes [Gilbert:58a], [Fox:88a], [Salmon:84b], which gives the Cartesian coordinates in processor space and communication channel tags for a processor's nearest neighbors.
We mapped the matrices into processor space by local decomposition. Let
and
be the number of processors in the rows and columns of the
hypercube configuration, respectively. Element
of an
matrix is placed in processor row
and column
,
where
means the integer part of x.
The parallel code implemented on the hypercube consists of five major steps.
Step one constructs, for each value of , a primitive basis set
composed of the product of Wigner rotation matrices, associated Legendre
functions, and the numerical one-dimensional functions in
mentioned in Section 8.2.2 and obtained by solving the corresponding
one-dimensional eigenvalue-eigenvector differential equation using a finite
difference method. This requires that a subset of the eigenvalues and
eigenvectors of a tridiagonal matrix be found.
A bisection method [Fox:84g], [Ipsen:87a;87c], which accomplishes the eigenvalue computation using the TRIDIB routine from EISPACK [Smith:76a], was ported to the Mark IIIfp. This implementation of the bisection method allows computation of any number of consecutive eigenvalues specified by their indices. Eigenvectors are obtained using the EISPACK inverse iteration routine TINVIT with modified Gram-Schmidt orthogonalization. Each processor solves independent tridiagonal eigenproblems since the number of eigenvalues desired from each tridiagonal system is small, but there are a large number of distinct tridiagonal systems. To achieve load balancing, we distributed subsets of the primitive functions among the processors in such a way that no processor computes greater than one eigenvalue and eigenvector more than any other. These large grain tasks are most easily implemented on MIMD machines; SIMD (Single Instruction Multiple Data) machines would require more extensive modifications and would be less efficient because of the sequential nature of effective eigenvalue iteration procedures. The one-dimensional bases obtained are then broadcast to all the other nodes.
In step two, a large number of two-dimensional quadratures involving the primitive basis functions which are needed for the variational procedure are evaluated. These quadratures are highly parallel procedures requiring no communication overhead once each processor has the necessary subset of functions. Each processor calculates a subset of integrals independently.
Step three assembles these integrals into the real symmetric dense
matrices and
which are distributed over
processor space. The entire spectrum of eigenvalues and
eigenvectors for the associated
variational problem is sought. With the parallel implementation of the
Householder method [Fox:84h],
[Patterson:86a], this generalized eigensystem is tridiagonalized
and the resulting single tridiagonal matrix is solved completely in each processor with the QR
algorithm [Wilkinson:71a]. The QR
implementation is purely sequential since each processor obtains the
entire solution to the eigensystem. However, only different subsets of
the solution are kept in different processors for the evaluation of the
interaction and overlap matrices in step four. This part of the
algorithm is not time consuming and the straightforward sequential
approach was chosen. It has the further effect that the resulting
solutions are fully distributed, so no communication is required.
Step four evaluates the two-dimensional quadratures needed for the
interaction and
overlap
matrices. The same type of algorithms are used as in step two. By
far, the most expensive part of the sequential version of the surface
function calculation is the calculation of the large number of
two-dimensional numerical integrals required by steps two and four.
These steps are, however, highly parallel and well suited for the
hypercube.
Step five uses Manolopoulos' [Manolopoulos:86a] algorithm to integrate
the coupled linear ordinary differential equations. The parallel
implementation of this algorithm is discussed elsewhere [Hipes:88b].
The algorithm is dominated by parallel Gauss-Jordan matrix inversion and is
I/O intensive, requiring the input of one interaction
matrix per integration step. To reduce the I/O overhead, a second
source of parallelism is exploited. The entire interaction matrix (at
all ) and overlap matrix (at all
) data sets are
loaded across the processors, and many collision energies are
calculated simultaneously. This strategy works because the same set of
data is used for each collision energy, and because enough main memory
is available. Calculation of scattering matrices from the final
logarithmic derivative matrices is not computationally intensive, and
is done sequentially.
The program steps were all run on the Weitek coprocessor, which only supports 32-bit arithmetic. Experimentation has shown that this precision is sufficient for the work reported below. The 64-bit arithmetic hardware needed for larger calculations was installed after the present calculations were completed.