To make good use of MIMD distributed-memory machines like hypercubes, one
should employ domain decomposition. That is, the domain of the problem
should be divided into subdomains of equal size, one for each processor in
the hypercube; and communication routines should be written to take care of
data transfer across the processor boundaries. Thus, for a lattice
calculation, the N sites are distributed among the processors
using a decomposition that ensures that processors assigned to adjacent
subdomains are directly linked by a communication channel in the hypercube
topology. Each processor then independently works through its subdomain of
sites, updating each one in turn, and only communicating with
neighboring processors when doing boundary sites. This communication
enforces ``loose synchronization,'' which
stops any one processor from racing ahead of the others. Load
balancing is achieved with equal-size domains. If the nodes contain at
least two sites of the lattice, all the nodes can update in parallel,
satisfying detailed balance, since loose
synchronicity guarantees that all nodes will be doing black, then red
sites alternately.
The characteristic timescale of the communication, , corresponds
roughly to the time taken to transfer a single
matrix from one node to
its neighbor. Similarly, we can characterize the calculational part of the
algorithm by a timescale,
, which is roughly the time taken to
multiply together two
matrices. For all hypercubes built without floating-point accelerator chips,
and, hence,
QCD simulations are extremely ``efficient,'' where efficiency
(Equations 3.10 and 3.11) is defined by
the relation
where is the time taken for k processors to perform the given
calculation. Typically, such calculations have efficiencies in the range
, which means they are ideally suited to this type of
computation since doubling the number of processors nearly halves the
total computational time required for solution. However, as we shall see
(for the Mark IIIfp hypercube, for example), the picture changes dramatically
when fast floating-point chips are used; then
and
one must take some care in coding to obtain maximum performance.
Rather than describe every calculation done on the Caltech hypercubes, we shall concentrate on one calculation that has been done several times as the machine evolved-the heavy quark potential calculation (``heavy'' because the quenched approximation is used).
QCD provides an explanation of why quarks are confined inside hadrons, since lattice calculations reveal that the inter-quark potential rises linearly as the separation between the quarks increases. Thus, the attractive force (the derivative of the potential) is independent of the separation, unlike other forces, which usually decrease rapidly with distance. This force, called the ``string tension,'' is carried by the gluons, which form a kind of ``string'' between the quarks. On the other hand, at short distances, quarks and gluons are ``asymptotically free'' and behave like electrons and photons, interacting via a Coulomb-like force. Thus, the quark potential V is written as
where R is the separation of the quarks, is the coefficient of the
Coulombic potential and
is the string tension. In fitting
experimental charmonium data to this Coulomb plus linear potential, Eichten
et al. [Eichten:80a] estimated that
and
=0.18GeV
. Thus, a goal of the lattice calculations is to
reproduce these numbers. Enroute to this goal, it is necessary to show that
the numbers from the lattice are ``scaling,'' that is, if one calculates a
physical observable on lattices with different spacings then one gets the
same answer. This means that the artifacts due to the finiteness of the
lattice spacing have disappeared and continuum physics can be extracted.
The first heavy quark potential calculation using a Caltech hypercube was
performed on the 64-node Mark I in 1984 on a lattice with
ranging from
to
[Otto:84a]. The value of
was found to be
and the string tension (converting to
the dimensionless ratio)
. The numbers
are quite a bit off from the charmonium data but the string tension did
appear to be scaling, albeit in the narrow window
.
The next time around, in 1986, the 128-node Mark II hypercube was used on a
lattice with
[Flower:86b]. The
dimensionless string tension decreased somewhat to 83, but clear violations
of scaling were observed: The lattice was still too coarse to see continuum
physics.
Therefore, the last (1989) calculation using the Caltech/JPL 32-node
Mark IIIfp hypercube concentrated on one value,
, and
investigated different lattice sizes:
,
,
,
[Ding:90b]. Scaling was not
investigated; however, the values of
and
, that is,
= 0.15GeV
,
compare favorably with the charmonium data. This work is based on
about 1300 CPU hours on the 32-node Mark IIIfp hypercube, which has a
performance of roughly twice a CRAY X-MP processor. The
whole 128-node machine performs at
. As each node
runs at
, this corresponds to a speedup of 100, and
hence, an efficiency of 78 percent. These figures are for the most
highly optimized code. The original version of the code written in C
ran on the Motorola chips at
and on the Weitek
chips at
. The communication time, which is roughly
the same for both, is less than a 2 percent overhead for the former but
nearly 30 percent for the latter. When the computationally intensive
parts of the calculation are written in assembly code for the Weitek,
this overhead becomes almost 50 percent. This
of
communication, shown in lines two and three in Table 4.4, is
dominated by the hardware/software message startup overhead
(latency), because for the Mark IIIfp the node-to-node
communication time,
, is given by
where W is the number of words transmitted. To speed up the
communication, we update all even (or odd) links (eight in our case) in
each node, allowing us to transfer eight matrix products at a time,
instead of just sending one in each message. This reduces the by a factor of
to . On all hypercubes with fast floating-point
chips-and on most hypercubes without these chips for less
computationally intensive codes-such ``vectorization'' of
communication is often important. In Figure 4.10, the
speedups for many different lattice sizes are shown. For the largest
lattice size, the speedup is 100 on the 128-node machine. The speedup
is almost linear in number of nodes. As the total lattice volume
increases, the speedup increases, because the ratio of
calculation/communication increases. For more information on this
performance analysis,
see [Ding:90c].
Table 4.4: Link Update Time (msec) on Mark IIIfp Node for Various Levels
of Programming
Figure 4.10: Speedups for QCD on the Mark IIIfp