To make good use of MIMD distributed-memory machines like hypercubes, one
should employ domain decomposition. That is, the domain of the problem
should be divided into subdomains of equal size, one for each processor in
the hypercube; and communication routines should be written to take care of
data transfer across the processor boundaries. Thus, for a lattice
calculation, the **N** sites are distributed among the processors
using a decomposition that ensures that processors assigned to adjacent
subdomains are directly linked by a communication channel in the hypercube
topology. Each processor then independently works through its subdomain of
sites, updating each one in turn, and only communicating with
neighboring processors when doing boundary sites. This communication
enforces ``loose synchronization,'' which
stops any one processor from racing ahead of the others. Load
balancing is achieved with equal-size domains. If the nodes contain at
least two sites of the lattice, all the nodes can update in parallel,
satisfying detailed balance, since loose
synchronicity guarantees that all nodes will be doing black, then red
sites alternately.

The characteristic timescale of the communication, , corresponds
roughly to the time taken to transfer a single matrix from one node to
its neighbor. Similarly, we can characterize the calculational part of the
algorithm by a timescale, , which is roughly the time taken to
multiply together two matrices. For all hypercubes built *without* floating-point accelerator chips, and, hence,
QCD simulations are extremely ``efficient,'' where efficiency
(Equations 3.10 and 3.11) is defined by
the relation

where is the time taken for **k** processors to perform the given
calculation. Typically, such calculations have efficiencies in the range
, which means they are ideally suited to this type of
computation since doubling the number of processors nearly halves the
total computational time required for solution. However, as we shall see
(for the Mark IIIfp hypercube, for example), the picture changes dramatically
when fast floating-point chips are used; then and
one must take some care in coding to obtain maximum performance.

Rather than describe every calculation done on the Caltech hypercubes, we shall concentrate on one calculation that has been done several times as the machine evolved-the heavy quark potential calculation (``heavy'' because the quenched approximation is used).

QCD provides an explanation of why quarks are confined inside hadrons, since
lattice calculations reveal that the inter-quark potential rises linearly as
the separation between the quarks increases. Thus, the attractive force
(the derivative of the potential) is independent of the separation,
unlike other forces, which usually decrease rapidly with distance. This
force, called the ``string tension,'' is carried by the gluons, which form a
kind of ``string'' between the quarks. On the other hand, at short distances,
quarks and gluons are
``asymptotically free'' and behave like electrons and photons, interacting
via a Coulomb-like force. Thus, the quark potential **V** is written as

where **R** is the separation of the quarks, is the coefficient of the
Coulombic potential and is the string tension. In fitting
experimental charmonium data to this Coulomb plus linear potential, Eichten
et al. [Eichten:80a] estimated that and
=0.18GeV. Thus, a goal of the lattice calculations is to
reproduce these numbers. Enroute to this goal, it is necessary to show that
the numbers from the lattice are ``scaling,'' that is, if one calculates a
physical observable on lattices with different spacings then one gets the
same answer. This means that the artifacts due to the finiteness of the
lattice spacing have disappeared and continuum physics can be extracted.

The first heavy quark potential calculation using a Caltech hypercube was performed on the 64-node Mark I in 1984 on a lattice with ranging from to [Otto:84a]. The value of was found to be and the string tension (converting to the dimensionless ratio) . The numbers are quite a bit off from the charmonium data but the string tension did appear to be scaling, albeit in the narrow window .

The next time around, in 1986, the 128-node Mark II hypercube was used on a lattice with [Flower:86b]. The dimensionless string tension decreased somewhat to 83, but clear violations of scaling were observed: The lattice was still too coarse to see continuum physics.

Therefore, the last (1989) calculation using the Caltech/JPL 32-node Mark IIIfp hypercube concentrated on one value, , and investigated different lattice sizes: , , , [Ding:90b]. Scaling was not investigated; however, the values of and , that is, = 0.15GeV, compare favorably with the charmonium data. This work is based on about 1300 CPU hours on the 32-node Mark IIIfp hypercube, which has a performance of roughly twice a CRAY X-MP processor. The whole 128-node machine performs at . As each node runs at , this corresponds to a speedup of 100, and hence, an efficiency of 78 percent. These figures are for the most highly optimized code. The original version of the code written in C ran on the Motorola chips at and on the Weitek chips at . The communication time, which is roughly the same for both, is less than a 2 percent overhead for the former but nearly 30 percent for the latter. When the computationally intensive parts of the calculation are written in assembly code for the Weitek, this overhead becomes almost 50 percent. This of communication, shown in lines two and three in Table 4.4, is dominated by the hardware/software message startup overhead (latency), because for the Mark IIIfp the node-to-node communication time, , is given by

where **W** is the number of words transmitted. To speed up the
communication, we update all even (or odd) links (eight in our case) in
each node, allowing us to transfer eight matrix products at a time,
instead of just sending one in each message. This reduces the by a factor of

to . On all hypercubes with fast floating-point chips-and on most hypercubes without these chips for less computationally intensive codes-such ``vectorization'' of communication is often important. In Figure 4.10, the speedups for many different lattice sizes are shown. For the largest lattice size, the speedup is 100 on the 128-node machine. The speedup is almost linear in number of nodes. As the total lattice volume increases, the speedup increases, because the ratio of calculation/communication increases. For more information on this performance analysis, see [Ding:90c].

**Table 4.4:** Link Update Time (msec) on Mark IIIfp Node for Various Levels
of Programming

**Figure 4.10:** Speedups for QCD on the Mark IIIfp

Wed Mar 1 10:19:35 EST 1995