In Figure 1 already one subclass of this type of machines was shown. In fact, the single-processor vector machine discussed there was a special case of a more general type. The figure shows that more than one FPU and/or VPU may be possible in one system.
The main problem one is confronted with in shared-memory systems is that of the connection of the CPUs to each other and to the memory. As more CPUs are added, the collective bandwidth to the memory ideally should increase linearly with the number of processors, while each processor should preferably communicate directly with all others without the much slower alternative of having to use the memory in an intermediate stage. Unfortunately, full interconnection is quite costly, growing with while increasing the number of processors with O(n). So, various alternatives have been tried. Figure 3 shows some of the interconnection structures that are (and have been) used.
As can be seen from the figure, a crossbar uses connections, an -network uses connections while, with the central bus, there is only one connection. This is reflected in the use of each connection path for the different types of interconnections: for a crossbar each datapath is direct and does not have to be shared with other elements. In case of the -network there are switching stages and as many data items may have to compete for any path. For the central databus all data have to share the same bus, so n data items may compete at any time.
Figure 3: Some examples of interconnection structures used in
shared-memory MIMD systems.
The bus connection is the least expensive solution, but it has the
obvious drawback that bus contention may occur thus slowing down the
computations. Various intricate strategies have been devised using
caches associated with the CPUs to minimise the bus traffic. This leads
however to a more complicated bus structure which raises the costs. In
practice it has proved to be very hard to design buses that are fast
enough, especially where the speed of the processors have been
increasing very quickly and it imposes an upper bound on the number of
processors thus connected that in practice appears not to exceed a
number of 10-20. In 1992, a new standard (IEEE P896) for a fast bus to
connect either internal system components or to external systems has
been defined. This bus, called the Scalable Coherent Interface (SCI)
should provide a point-to-point bandwidth of 200-1,000 Mbyte/s. It is
in fact used to in the HP/Convex SPP-1200, but could also be used within
a network of workstations. The SCI is much more than a simple bus and
it can act as the hardware network framework for distributed computing,
see [#sci#
The -network is a structure which is situated somewhere in between a
bus and a crossbar which respect to potential capacity and costs. At this
moment of the commercially available machines the IBM SP2, the Meiko CS-2, and
the Cenju-3 use this network structure, but a number of experimental machines
also have used this or a similar kind of interconnection. The BBN TC2000 that
acted as a virtual shared-memory MIMD system used an analogous type of network
(a Butterfly-network) and it is quite conceivable that new machines may use it,
especially as the number of processors grows. For a large number of processors
the connections become quickly more attractive than the used
in crossbars. Of course, the switches at the intermediate levels should be
sufficiently fast to cope with the bandwidth required.
In all present-day multi-processor vectorprocessors crossbars are used. This
is still feasible because the maximum number of processors in a system is still
rather small (32 at most presently). When the number of processors would
increase, however, technological problems might arise. Not only it becomes
harder to build a crossbar of sufficient speed for the larger numbers of
processors, the processors themselves generally also increase in speed
individually, doubling the problems of making the speed of the crossbar match
that of the bandwidth required by the processors.
Whichever network is used, the type of processors in principle could be
arbitrary for any topology. In practice, however, bus structured
machines do not have vector processors as the speeds of these would
grossly mismatch with any bus that could be constructed with reasonable
costs. RISC processors are however employed for bus-oriented systems.
Caches can sometimes alleviate the bandwidth problem, but only when the
data-access pattern allows them to be used sufficiently.
The systems discussed in this subsection are of the MIMD type and therefore
different tasks may run on different processors simultaneously. In many cases
synchronisation between tasks is required and again the interconnection
structure is here very important. Most vectorprocessors employ special
communication registers within the CPUs by which they can communicate directly
with the other CPUs they have to synchronise with. A minority of systems does
synchronisation via the shared memory. Generally, this is much slower but may
still be acceptable when the synchronisation occurs relatively seldom. Of
course for bus-based systems communication also have to be done via a bus.
This bus is mostly separated from the databus to assure a maximum speed for the
synchronisation.
Next: Distributed-memory MIMD machines
Up: The Main Architectural Classes
Previous: Distributed-memory SIMD machines
Jack Dongarra
Sat Feb 10 15:12:38 EST 1996