 
  
  
  
  
 
The necessity of evaluating large numbers of primitive two-electron integrals makes the SMC procedure a natural candidate for parallelization on a coarse-grain MIMD machine. With a large memory per processor, it is feasible to load the integral evaluator on each node and to distribute the evaluation of the primitive integrals among all the processors. Provided issues of load balance and subsequent data reduction can be addressed successfully, high parallel efficiency may be anticipated, since the stage of the calculation which typically consumes the bulk of the computation time is thereby made perfectly parallel.
In planning the decomposition of the set of integrals onto the nodes, two 
principal issues must be considered.  First, there are too many integrals 
to store in memory simultaneously, and certain indices must therefore be 
processed sequentially.  Second, the transformation from primitive integrals 
to physical matrix elements, which necessarily involves interprocessor 
communication, should be as efficient and transparent as possible.  With 
these considerations in mind, the approach chosen was to configure the nodes 
logically as a two-torus, on which is mapped an array of integrals whose 
columns are labeled by Gaussian pairs  , and whose rows are 
labeled by directions
, and whose rows are 
labeled by directions  ; the indices
; the indices  and
 and  are 
processed sequentially.  With this decomposition, the transformation steps 
and associated interprocessor communication can be localized and ``hidden'' 
in parallel matrix multiplications.  This approach is both simple and 
efficient, and results in a program that is easily ported to new machines.
 are 
processed sequentially.  With this decomposition, the transformation steps 
and associated interprocessor communication can be localized and ``hidden'' 
in parallel matrix multiplications.  This approach is both simple and 
efficient, and results in a program that is easily ported to new machines.
Care was needed in designing the parallel transformation procedure. Direct emulation of the sequential code-that is, transformation first to molecular-orbital integrals and then to physical matrix elements-is undesirable, because the latter step would entail a parallel routine of great complexity governing the flow of a relatively limited amount of data between processors. Instead, the two transformations are combined into a single step by using the logical outline of the original molecular-orbital-to-physical-matrix-element routine in a perfectly parallel routine which builds a distributed transformation matrix. The combined transformations are then accomplished by a single series of large, almost-full complex-arithmetic-matrix multiplications on the primitive-integral data set.
The remainder of the parallel implementation involves relatively 
straightforward modifications of the sequential CRAY code, with the exception 
of a series of integrations over angles  arising in the evaluation of 
the
 arising in the evaluation of 
the  matrix elements, and of the solution of a system of 
linear equations in the final phase of the calculation.  The angular 
integration, done by Gauss-Legendre quadrature, is compactly and efficiently 
coded as a distributed matrix multiplication of the form
 matrix elements, and of the solution of a system of 
linear equations in the final phase of the calculation.  The angular 
integration, done by Gauss-Legendre quadrature, is compactly and efficiently 
coded as a distributed matrix multiplication of the form 
 .  The solution of the linear 
system is performed by a distributed LU solver [Hipes:89b] modified for 
complex arithmetic.
.  The solution of the linear 
system is performed by a distributed LU solver [Hipes:89b] modified for 
complex arithmetic.
The implementation described above has proven quite successful [Hipes:90a], [Winstead:91d] on the Mark IIIfp architecture for which it was originally designed, and has since been ported with modest effort to the Intel iPSC/860 and subsequently to the 528-processor Intel Touchstone Delta. No algorithmic modifications were necessary in porting to the Intel machines; modifications to improve efficiency will be described below. Complications did arise from the somewhat different communication model embodied in Intel's NX system, as compared to the more rigidly structured, loosely synchronous CrOS III operating system of the Mark IIIfp described in Chapter 5. These problems were overcome by careful typing of all interprocessor messages-essentially, assigning of sequence numbers and source labels. In porting to the Delta, the major difficulty was the absence of a host processor. Our original parallel version of the SMC code left certain initialization and end-stage routines, which were computationally trivial but logically complicated, as sequential routines to be executed on the host. In porting to the Delta, we chose to parallelize these routines as well rather than allocate an extra node to serve as host. There is thus only one program to maintain and to port to subsequent machines, and a rectangular array of nodes, suitable for a matrix-oriented computation, is preserved.
 
 
  
  
  
 