The necessity of evaluating large numbers of primitive two-electron integrals makes the SMC procedure a natural candidate for parallelization on a coarse-grain MIMD machine. With a large memory per processor, it is feasible to load the integral evaluator on each node and to distribute the evaluation of the primitive integrals among all the processors. Provided issues of load balance and subsequent data reduction can be addressed successfully, high parallel efficiency may be anticipated, since the stage of the calculation which typically consumes the bulk of the computation time is thereby made perfectly parallel.
In planning the decomposition of the set of integrals onto the nodes, two principal issues must be considered. First, there are too many integrals to store in memory simultaneously, and certain indices must therefore be processed sequentially. Second, the transformation from primitive integrals to physical matrix elements, which necessarily involves interprocessor communication, should be as efficient and transparent as possible. With these considerations in mind, the approach chosen was to configure the nodes logically as a two-torus, on which is mapped an array of integrals whose columns are labeled by Gaussian pairs , and whose rows are labeled by directions ; the indices and are processed sequentially. With this decomposition, the transformation steps and associated interprocessor communication can be localized and ``hidden'' in parallel matrix multiplications. This approach is both simple and efficient, and results in a program that is easily ported to new machines.
Care was needed in designing the parallel transformation procedure. Direct emulation of the sequential code-that is, transformation first to molecular-orbital integrals and then to physical matrix elements-is undesirable, because the latter step would entail a parallel routine of great complexity governing the flow of a relatively limited amount of data between processors. Instead, the two transformations are combined into a single step by using the logical outline of the original molecular-orbital-to-physical-matrix-element routine in a perfectly parallel routine which builds a distributed transformation matrix. The combined transformations are then accomplished by a single series of large, almost-full complex-arithmetic-matrix multiplications on the primitive-integral data set.
The remainder of the parallel implementation involves relatively straightforward modifications of the sequential CRAY code, with the exception of a series of integrations over angles arising in the evaluation of the matrix elements, and of the solution of a system of linear equations in the final phase of the calculation. The angular integration, done by Gauss-Legendre quadrature, is compactly and efficiently coded as a distributed matrix multiplication of the form . The solution of the linear system is performed by a distributed LU solver [Hipes:89b] modified for complex arithmetic.
The implementation described above has proven quite successful [Hipes:90a], [Winstead:91d] on the Mark IIIfp architecture for which it was originally designed, and has since been ported with modest effort to the Intel iPSC/860 and subsequently to the 528-processor Intel Touchstone Delta. No algorithmic modifications were necessary in porting to the Intel machines; modifications to improve efficiency will be described below. Complications did arise from the somewhat different communication model embodied in Intel's NX system, as compared to the more rigidly structured, loosely synchronous CrOS III operating system of the Mark IIIfp described in Chapter 5. These problems were overcome by careful typing of all interprocessor messages-essentially, assigning of sequence numbers and source labels. In porting to the Delta, the major difficulty was the absence of a host processor. Our original parallel version of the SMC code left certain initialization and end-stage routines, which were computationally trivial but logically complicated, as sequential routines to be executed on the host. In porting to the Delta, we chose to parallelize these routines as well rather than allocate an extra node to serve as host. There is thus only one program to maintain and to port to subsequent machines, and a rectangular array of nodes, suitable for a matrix-oriented computation, is preserved.