The UltraSPARC IV is the fourth generation of the UltraSPARC family. Like the HP PA-RISC 8800 and the IBM POWER4/5 also Sun has put two processor cores on a chip. The CPU cores are in fact slightly modified UltraSPARC III processors. It is built in 0.13 µm CMOS technology at a clock frequency that is currently 1.2 GHz. UltraSPARCs are used in all SUN products from workstations to the heavy Fire 3800-15K servers. We show a block diagram of the processor core and its embedding in the UltraSPARC IV chip in Figure 15a.
Figure 15a: Block diagram of the UltraSPARC IV processor core.
The processor is characterised by large large amount of caches of various sorts
as can be seen in Figure 15a. The Data Cache Unit
(DCU) contains apart from a 4-way set associative cache of 64 KB also a write
and a pre-fetch cache, both of 2 KB. All these L1 caches operate at half speed: loads and
stores from the processor can be done in 2 cycles.
The pre-fetch cache is independent from
the data cache and can load data when this is deemed appropriate. The write
cache defers writes to the L2 cache and so may evade unnecessary writes of
individual bytes until entire cache lines have to be updated. The Instruction
Issue Unit (IIU) contains the 32 KB 4-way set associative instruction cache
together with the instruction TLB which is called Instruction translation
buffer in SUN's terminology. The IIU also contains a so-called miss queue that
holds instructions that are immediately available for the execute units when a
branch has been mis-predicted. Branch prediction is fully static in the
UltraSPARC-III. It is implemented as a 16 KB table in the IIU that is pipelined
because of its size.
The Integer Execute Unit (IEU) has two Add/Logical Units and a branch
unit. Integer adds and multiplies are pipelined but the divide
operation is not. It is performed by an Arithmetic Special Unit (not
shown in the figure) that does not burden the pipelines for the ALUs.
The integer register file is effectively divided in two and is called
the Working and Architectural Register File by SUN. Operands are
accessed and results stored in the working registers. When an exception
occurs, the results to be undone in the working registers are overwritten
by those from the architectural file. One of the enhancements with
respect to the original UltraSPARC III design is the adding of hash indexing for
the write cache. This should decrease the number of write misses and thus leave
more write store bandwidth for results that need storing.
The floating-point unit (FPU) has two independent pipelined units for
addition and multiplication and a non-pipelined unit for floating
division and square-root computation that require in the order of
20--25 cycles. The FPU also contains graphics hardware (not shown in
Figure 15a) that shares the pipelined adder
and multiplier with general 64-bit calculations. For the chips
delivered at 1.2 GHz, the theoretical peak performance is 2.4 Gflop/s per
It is expected that the UltraSPARC-IV technology can be
shrunk to reach a clock frequency that is slightly more by the end of its life
cycle. In the UltraSPARC IV the FPUs are enhanced by adding hardware
support of handling for IEEE 754 floating-point errors (which can be very costly
otherwise when properly handled).
The size of the L2 cache 2×8 MB each part being targeted by one of the
processor cores. The L2 cache lines have been shortened from 512 B to 128 B in
the UltraSPARC IV to reduce data contention.
15b.
Figure 15b: Chip layout of the UltraSPARC IV processor.
As is evident from Figure 15b
the Memory Control Unit (MCU) is on chip as well as the L2
cache controller (in the MCU) and the L2 cache tags. This shortens the latency
of accesses from both memory levels. In addition, both controllers communicate
with the System Interface Unit (SIU), also on-chip to keep in touch with the
snoop pipe controller in the SIU. The processor has been built with
multi-processing in mind and the snoop controller keeps track of data requests
in the whole system to ensure coherency of the caches when required.
The UltraSPARC IV is around since February 2004. Sun refers to having the two
processor cores on a chip and running one execution thread on each of them as
Chip Multithreading (CMT). This is not quite what one would normally would
understand as multi-threading because one would then expect more execution
threads per processor core. So, the CMT terminology is somewhat confusing and
one would hope that Sun will drop it in favour of the common use of the term.
At the moment it is not entirely clear what lays ahead with respect to the SPARC
family. On one hand the cooperation with Fujitsu-Siemens has intensified and Sun
will market their systems with the SPARC64 implementation of Fujitsu-Siemens.
The SPARC64 line is the "classic" in the sense that it is a standard RISC
design be it that the end of 2004 dual core versions will reach the market. On
the other hand Sun has shelved its own plans to produce UltraSPARC V and VI
processor by April 2004 in favour of processor designs with many (≥ 8)
processor cores, each capable of handling several execution threads. The first
of these processors, named Niagara, is in the testing phase. Is has 8 cores
where each core can execute 4 threads. It is probably ready for the market by
the end of 2005. Sun wants to pursue this kind of processing, dubbed
"throughput processing", also in next generations: the so-called Rock
processor will contain more cores and more threads per core. It is expected that
some kind of specialisation in the cores will be introduced because it would be
difficult to manage such massive execution thread bundles optimally.