To make good use of MIMD distributed-memory machines like hypercubes, one should employ domain decomposition. That is, the domain of the problem should be divided into subdomains of equal size, one for each processor in the hypercube; and communication routines should be written to take care of data transfer across the processor boundaries. Thus, for a lattice calculation, the N sites are distributed among the processors using a decomposition that ensures that processors assigned to adjacent subdomains are directly linked by a communication channel in the hypercube topology. Each processor then independently works through its subdomain of sites, updating each one in turn, and only communicating with neighboring processors when doing boundary sites. This communication enforces ``loose synchronization,'' which stops any one processor from racing ahead of the others. Load balancing is achieved with equal-size domains. If the nodes contain at least two sites of the lattice, all the nodes can update in parallel, satisfying detailed balance, since loose synchronicity guarantees that all nodes will be doing black, then red sites alternately.
The characteristic timescale of the communication, , corresponds roughly to the time taken to transfer a single matrix from one node to its neighbor. Similarly, we can characterize the calculational part of the algorithm by a timescale, , which is roughly the time taken to multiply together two matrices. For all hypercubes built without floating-point accelerator chips, and, hence, QCD simulations are extremely ``efficient,'' where efficiency (Equations 3.10 and 3.11) is defined by the relation
where is the time taken for k processors to perform the given calculation. Typically, such calculations have efficiencies in the range , which means they are ideally suited to this type of computation since doubling the number of processors nearly halves the total computational time required for solution. However, as we shall see (for the Mark IIIfp hypercube, for example), the picture changes dramatically when fast floating-point chips are used; then and one must take some care in coding to obtain maximum performance.
Rather than describe every calculation done on the Caltech hypercubes, we shall concentrate on one calculation that has been done several times as the machine evolved-the heavy quark potential calculation (``heavy'' because the quenched approximation is used).
QCD provides an explanation of why quarks are confined inside hadrons, since lattice calculations reveal that the inter-quark potential rises linearly as the separation between the quarks increases. Thus, the attractive force (the derivative of the potential) is independent of the separation, unlike other forces, which usually decrease rapidly with distance. This force, called the ``string tension,'' is carried by the gluons, which form a kind of ``string'' between the quarks. On the other hand, at short distances, quarks and gluons are ``asymptotically free'' and behave like electrons and photons, interacting via a Coulomb-like force. Thus, the quark potential V is written as
where R is the separation of the quarks, is the coefficient of the Coulombic potential and is the string tension. In fitting experimental charmonium data to this Coulomb plus linear potential, Eichten et al. [Eichten:80a] estimated that and =0.18GeV. Thus, a goal of the lattice calculations is to reproduce these numbers. Enroute to this goal, it is necessary to show that the numbers from the lattice are ``scaling,'' that is, if one calculates a physical observable on lattices with different spacings then one gets the same answer. This means that the artifacts due to the finiteness of the lattice spacing have disappeared and continuum physics can be extracted.
The first heavy quark potential calculation using a Caltech hypercube was performed on the 64-node Mark I in 1984 on a lattice with ranging from to [Otto:84a]. The value of was found to be and the string tension (converting to the dimensionless ratio) . The numbers are quite a bit off from the charmonium data but the string tension did appear to be scaling, albeit in the narrow window .
The next time around, in 1986, the 128-node Mark II hypercube was used on a lattice with [Flower:86b]. The dimensionless string tension decreased somewhat to 83, but clear violations of scaling were observed: The lattice was still too coarse to see continuum physics.
Therefore, the last (1989) calculation using the Caltech/JPL 32-node Mark IIIfp hypercube concentrated on one value, , and investigated different lattice sizes: , , , [Ding:90b]. Scaling was not investigated; however, the values of and , that is, = 0.15GeV, compare favorably with the charmonium data. This work is based on about 1300 CPU hours on the 32-node Mark IIIfp hypercube, which has a performance of roughly twice a CRAY X-MP processor. The whole 128-node machine performs at . As each node runs at , this corresponds to a speedup of 100, and hence, an efficiency of 78 percent. These figures are for the most highly optimized code. The original version of the code written in C ran on the Motorola chips at and on the Weitek chips at . The communication time, which is roughly the same for both, is less than a 2 percent overhead for the former but nearly 30 percent for the latter. When the computationally intensive parts of the calculation are written in assembly code for the Weitek, this overhead becomes almost 50 percent. This of communication, shown in lines two and three in Table 4.4, is dominated by the hardware/software message startup overhead (latency), because for the Mark IIIfp the node-to-node communication time, , is given by
where W is the number of words transmitted. To speed up the communication, we update all even (or odd) links (eight in our case) in each node, allowing us to transfer eight matrix products at a time, instead of just sending one in each message. This reduces the by a factor of
to . On all hypercubes with fast floating-point chips-and on most hypercubes without these chips for less computationally intensive codes-such ``vectorization'' of communication is often important. In Figure 4.10, the speedups for many different lattice sizes are shown. For the largest lattice size, the speedup is 100 on the 128-node machine. The speedup is almost linear in number of nodes. As the total lattice volume increases, the speedup increases, because the ratio of calculation/communication increases. For more information on this performance analysis, see [Ding:90c].
Table 4.4: Link Update Time (msec) on Mark IIIfp Node for Various Levels of Programming
Figure 4.10: Speedups for QCD on the Mark IIIfp