SINGLE PROCESSOR LOW-LEVEL BENCHMARKS


The single-processor low-level benchmarks provided by PARKBENCH, aim to measure performance parameters that characterise the basic architecture of the computer, and the compiler software through which it is used. For this reason, such benchmarks have also been called appropriately basic architectural benchmarks. Following the methodology of Euroben, the aim is that these hardware/compiler parameters will be used in performance formulae that predict the timing and performance of the more complex kernels and compact applications. They are therefore a set of synthetic benchmarks contrived to measure theoretical parameters that describe the severity of some overhead or potential bottleneck, or the properties of some item of hardware.

The fundamental measurement in any benchmarking is the measurement of elapsed wall-clock time. Because the computer clocks on each node of a multi-node MPP are not synchronized, all benchmark time measurements must be made with a single clock on one node of the system. The benchmarks TICK1 and TICK2 have, respectively, been designed to measure the resolution and to check the absolute value of this clock. These benchmarks should be run with satisfactory results before any further benchmark measurements are made.

All of these low-level kernels are available in the current distribution from the netlib repository.

  1. Timer resolution (TICK1). TICK1 measures the interval between ticks of the clock being used in the benchmark measurements. That is to say the resolution of the clock. A succession of calls to the timer routine are inserted in a loop and executed many times. The differences between successive values given by the timer are then examined. If the changes in the clock value (or ticks) occur less frequently than the time taken to enter and leave the timer routine, then most of these differences will be zero. When a tick takes place, however, a difference equal to the tick value will be recorded, surrounded by many zero differences. This is the case with clocks of poor resolution; for example most UNIX clocks that tick typically every 10 ms. Such poor UNIX clocks can still be used for low-level benchmark measurements if the benchmark is repeated, say, 10,000 times, and the timer calls are made outside this repeat loop.

    With some computers, such as the CRAY series, the clock ticks every cycle of the computer, that is to say every 6ns on the Y-MP. The resolution of the CRAY clock is therefore approximately one million times better than a UNIX clock, and that is quite a difference! If TICK1 is used on such a computer the difference between successive values of the timer is a very accurate measure of how long it takes to execute the instructions of the timer routine, and therefore is never zero. TICK1 takes the minimum of all such differences, and all it is possible to say is that the clock tick is less than or equal to this value. Typically this minimum will be several hundreds of clock ticks. With a clock ticking every computer cycle, we can make low-level benchmark measurements without a repeat loop. Such measurements can even be made on a busy timeshared system (where many users are contending for memory access) by taking the minumum time recorded from a sample of, say, 10,000 single execution measurements. In this case, the minimum can usually be said to apply to a case when there was no memory access delay caused by other users.

  2. Timer value (TICK2). TICK2 confirms that the absolute values returned by the computer clock are correct, by comparing its measurement of a given time interval with that of an external wall-clock (actually the benchmarker's wristwatch). Parallel benchmark performance can only be measured using the elapsed wall-clock time, because the objective of parallel execution is to reduce this time. Measurements made with a CPU-timer (which only records time when its job is executing in the CPU) are clearly incorrect, because the clock does not record waiting time when the job is out of the CPU. TICK2 will immediately detect the incorrect use of a CPU-time-for-this-job-only clock. An example of a timer that claims to measure elapsed time but is actually a CPU-timer, is the returned value of the popular Sun UNIX timer ETIME. TICK2 also checks that the correct multiplier is being used in the computer system software to convert clock ticks to true seconds.

  3. Basic arithmetic operations (RINF1). This benchmark takes a set of common Fortran DO-loops and analyzes their time of execution in terms of the two parameters, RINF and NHALF. RINF is the asymptotic performance rate in Mflop/s which is approached as the loop (or vector) length, n, becomes longer. NHALF (the half-performance length) expresses how rapidly, in terms increasing vector length, the actual performance, r, approaches RINF. It is defined as the vector length required to achieve a performance of one half of RINF.

  4. Memory bottleneck benchmarks (POLY1 and POLY2). Even if the vector lengths are long enough to overcome the vector startup overhead, the peak rate of the arithmetic pipelines may not be realised because of the delays associated with obtaining data from the cache or main memory of the computer. The POLY1 and POLY2 benchmarks quantify this dependence of computer performance on memory access bottlenecks.

    The POLY1 benchmark repeats the polynomial evaluation for each order typically 1000 times for vector lengths upto 10,000, which would normally fit into the cache of a cache-based processor. Except for the first evaluation the data will therefore be found in the cache. POLY1 is therefore an in-cache test of the memory bottleneck between the arithmetic registers of the processor and its cache.

    POLY2, on the other hand, flushes the cache prior to each different order and then performs only one polynomial evaluation, for vector lengths from 10,000 upto 100,000, which would normally exceed the cache size. Data will have to be brought from off-chip memory, and POLY2 is an out-of-cache test of the memory bottleneck between off-chip memory and the arithmetic registers.

PARKBENCH low-level page


Last Modified May 14, 1996