The fundamental measurement in any benchmarking is the measurement of elapsed wall-clock time. Because the computer clocks on each node of a multi-node MPP are not synchronized, all benchmark time measurements must be made with a single clock on one node of the system. The benchmarks TICK1 and TICK2 have, respectively, been designed to measure the resolution and to check the absolute value of this clock. These benchmarks should be run with satisfactory results before any further benchmark measurements are made.
All of these low-level kernels are available in the current distribution from the netlib repository.
With some computers, such as the CRAY series, the clock ticks every cycle of the computer, that is to say every 6ns on the Y-MP. The resolution of the CRAY clock is therefore approximately one million times better than a UNIX clock, and that is quite a difference! If TICK1 is used on such a computer the difference between successive values of the timer is a very accurate measure of how long it takes to execute the instructions of the timer routine, and therefore is never zero. TICK1 takes the minimum of all such differences, and all it is possible to say is that the clock tick is less than or equal to this value. Typically this minimum will be several hundreds of clock ticks. With a clock ticking every computer cycle, we can make low-level benchmark measurements without a repeat loop. Such measurements can even be made on a busy timeshared system (where many users are contending for memory access) by taking the minumum time recorded from a sample of, say, 10,000 single execution measurements. In this case, the minimum can usually be said to apply to a case when there was no memory access delay caused by other users.
The POLY1 benchmark repeats the polynomial evaluation for each order typically 1000 times for vector lengths upto 10,000, which would normally fit into the cache of a cache-based processor. Except for the first evaluation the data will therefore be found in the cache. POLY1 is therefore an in-cache test of the memory bottleneck between the arithmetic registers of the processor and its cache.
POLY2, on the other hand, flushes the cache prior to each different order and then performs only one polynomial evaluation, for vector lengths from 10,000 upto 100,000, which would normally exceed the cache size. Data will have to be brought from off-chip memory, and POLY2 is an out-of-cache test of the memory bottleneck between off-chip memory and the arithmetic registers.