Consider, first, the assignment of processes, (p,q), to physical
processors. In general,
more than one process may be assigned to a processor, so the problem may be
overdecomposed. To avoid load imbalance the same number of processes should
be assigned to each processor as nearly as possible. If this condition is
satisfied, the assignment of processes to processors can still affect
performance by influencing the communication overhead.
On recent distributed memory machines, such as the Intel Delta and CM-5,
the time to send a single message between two processors is largely
independent of their physical location
[27, 42, 43],
and hence the assignment of processes to
processors does not have much direct effect on performance. However, when
a collective communication task, such as
a broadcast, is being done, contention for physical resources can degrade
performance. Thus, the way in which processes are assigned to processors
can affect performance if some assignments result in differing amounts
of contention. Logarithmic contention-free broadcast algorithms have
been developed for processors connected as a two-dimensional mesh
[6, 45],
so on
such machines process (p,q) is usually mapped to the processor at position
(p,q) in the mesh of processors. Such an assignment also ensures that the
multiple one-dimensional broadcasts of and along the rows and
columns of the template, respectively, do not give rise to contention.