Intel Pentium 4/Xeon Nocona

Next: MIPS R16000 Up: The Main Architectural Classes Previous: Intel Itanium 2

Intel Pentium 4/Xeon Nocona

Although Pentium processors are not applied in integrated parallel systems these days, they play a major role in the cluster community as most compute nodes in Beowulf clusters are of this type. Therefore we briefly discuss also this type of processor. In fact, mostly the Xeon processor is used as it can accommodate a larger cache than the Pentium 4 and has some provisions for use in a multiprocessor environment. As virtually all Intel-based clusters use 2-processor nodes, the Xeon is the most appropriate processor. Apart from the differences mentioned here the architecture of the Pentium 4 and the Xeon are identical.
Intel only provides scant information on its processor. Therefore, a rough block diagram of the P4 processor can only be synthesized from various sources. It is shown in Figure 13.

Figure 13: Block diagram of the Intel Xeon Nocona.

We show here the Xeon variant with the large secondary cache that with the other additional features of the Nocona fits on chip because of the advanced 90 nm technology used to fabricate the chip. There are a number of distinctive features with respect to the earlier Pentium generations. There are two main ways to increase the performance of a processor: by raising the clock frequency and by increasing the number of instructions per cycle (IPC). These two approaches are generally in conflict: when one wants to increase the IPC the chip will become more complicated. This will have a negative impact on the clock frequency because more work has to be done and organised within the same clock cycle. Very seldomly chip designers succeed in raising both clock frequency and IPC simultaneously. Also in the Pentium 4 this could not be done. Intel has chosen for a high clock speed (initially about 40% more than that of the Pentium III with the same fabrication technology) while the IPC decreased by 10--20%. This still gives a net performance gain even if other changes would have been made to the processor. To sustain the very high clock rate that the present processors have, currently ≅ 3.8 GHz, a very deep instruction pipeline is required. The instruction pipeline has no less than 31 stages, where the Pentium III had 10. Although this favours a high clock rate, the penalty for a pipeline miss (e.g., a branch mis-predict) is much heavier and therefore Intel has improved the branch prediction by a increasing the size of the Branch Target Buffer from 0.5 to 4 KB. In addition, the Pentium 4 has an execution trace cache which holds partly decoded instructions of former execution traces that can be drawn upon, thus foregoing the instruction decode phase that might produce holes in the instruction pipeline. The allocator dispatches the decoded instructions, "micro operations", to the appropriate µop queue, one for memory operations, another for integer and floating-point operations.
Two integer Arithmetic/Logical Units are kept simple in order to be able to run them at twice the clock speed. In addition there is an ALU for complex integer operations that cannot be executed within one cycle. There is only one Floating-point functional unit that delivers one result per cycle. However, besides the normal Floating-point Unit, there also are additional units that execute the Streaming SIMD Extensions 2 and 3 (SSE2/3) repertoire of instructions, a 144-member instruction set, that is especially meant for multimedia, and 3-D visualisation applications. The length of the operands for these units is 128 bits. The Intel compilers have the ability to address the SSE2/3 units. This makes it in principle possible to achieve a two times higher floating-point performance.
The Xeon Nocona boast so-called Hyperthreading: with the processor two threads can run concurrently under some circumstances. This may for instance be used for speculative execution of if branches. Experiments have shown that up to 30% performance improvements can be attained for a variety of codes. In practice the performance gain about 3--5%, however.
The primary cache was quite small by today's standards: 8 KB. This has been doubled to 16 KB since the so-called Prescott implementation of the processor, however, at the cost of a higher latency in shipping data to the functional units. Where it was 2 cycles before, it has now increased to 3 cycles.
The largest difference with the former processors, however, is the ability to run (and address) 64-bit codes, thereby following AMD, in fact copying the approach used in the AMD Opteron and Athlon processors. The technique is called Extended Memory 64 Technology (EM64T) by Intel. In principle it uses ``unused bits'' from in the instruction words of the x86 instruction set to signal whether an 64-bit version of an instruction should be executed. Of course some additional devices are needed for operating in 64-bit mode. These include 8 new general purpose registers(GPRs), 8 new registers for SSE2/3 support, and 64-bit wide GPRs and instruction pointers.
It will depend heavily on the availability of compilers that are able to take advantage of all the facilities present in the Nocona processor. (Intel claims that a 30% performance improvement is possible). If they can, the processor could be a interesting basis for HPC clusters.

Next: MIPS R16000 Up: The Main Architectural Classes Previous: Intel Itanium 2

Aad van der Steen
Mon Oct 11 15:53:45 CEST 2004