Even if the vector lengths are long enough to overcome the vector startup overhead, the peak rate of the arithmetic pipelines may not be realised because of the delays associated with obtaining data from the cache or main memory of the computer. The POLY1 and POLY2 benchmarks quantify this dependence of computer performance on memory access bottlenecks. The computational intensity, f, of a DO-loop is defined as the number of floating-point operations performed per memory reference to an element of a vector variable [19]. The asymptotic performance, \rinf, of a computer is observed to increase as the computational intensity increases, because as this becomes larger, the effects of memory access delays become negligible compared to the time spent on arithmetic. This effect is characterised by the two parameters (\rhat, \fhalf), where \rhat~ is the peak hardware performance of the arithmetic pipeline, and \fhalf is the computational intensity required to achieve half this rate. That is to say the asymptotic performance is given by:
\rinf = \frac{\rhat}{(1+\fhalf/f)} (1)
If memory access and arithmetic are not overlapped, then \fhalf can be shown to be the ratio of arithmetic speed (in Mflop/s) to memory access speed (in Mword/s) [19]. The parameter \fhalf, like \nhalf, measures an unwanted overhead and should be as small as possible. In order to vary f and allow the peak performance to be approached, we choose a kernel loop that can be computed with maximum efficiency on any hardware. This is the evaluation of a polynomial by Horner's rule, in which case the computational intensity is the order of the polynomial, and both the multiply and add pipelines can be used in parallel. To measure \fhalf, the order of the polynomial is increased from one to ten, and the measured performance for long vectors is fitted to Eqn.(3).
The POLY1 benchmark repeats the polynomial evaluation for each order typically 1000 times for vector lengths up to 10,000, which would normally fit into the cache of a cache-based processor. Except for the first evaluation the data will therefore be found in the cache. POLY1 is therefore an in-cache test of the memory bottleneck between the arithmetic registers of the processor and its cache.
POLY2, on the other hand, flushes the cache prior to each different order and then performs only one polynomial evaluation, for vector lengths from 10,000 up to 100,000, which would normally exceed the cache size. Data will have to be brought from off-chip memory, and POLY2 is an out-of-cache test of the memory bottleneck between off-chip memory and the arithmetic registers.
The POLY1 benchmark exists as MOD1G of the EuroBen benchmarks [20]. POLY2 exists as part of the Hockney benchmarks.