Raúl de la Cruz (delacruz@bsc.es)
Computer Application department (CASE)
Barcelona Supercomputing Center (BSC) - Spain

Microbenchmark Source Code
==========================

The following is a brief description of how to compile and run the stencil
codes using our modified Stencil Probe microbenchmark. For further information
about this procedure please consult the README file included in the software
bundle.

   The Stencil Probe suite consists of the following stencil codes:

   * probe (Naive)
   * semi_probe (Naive + Semi-stencil)
   * blocked_probe (Rivera)
   * blocked_semi_probe (Rivera + Semi-stencil)

   * timeskew_probe (Time-skewing)
   * timeskew_semi_probe (Time-skewing + Semi-stencil)
   * oblivious_probe (Cache-oblivious)
   * oblivious_semi_probe (Cache-oblivious + Semi-stencil)

   The suite also includes a testing program (test) to check the correctness
of each algorithm results for a particular case. In order to compile the source
code, the GNU make tool is required. Note that, depending on the target
architecture, the compiler flags (CC, CFLAGS and OMPFLAG) should be modified
in the Makefile. Currently, flags for AMD64, x86 64, POWER6, POWER7 and
BlueGene/P architectures are already available. The TIMER variable should also
be set. The available timing methods are: the PAPI library, the specific cycle
counter and the GETTIMEOFDAY system call. After setting those variables,
the command

% make OPTS="-DPLOT -DNUM_TRIALS=2 -DFISSION_3LOOPS" \
          SUFFIX_DEF=".3loops.xlc" OPENMP=1

will compile the whole bundle using the specific configuration. The OPTS
parameter contains the compiler flags refering to source code features and
opti- mizations. For instance, -DPLOT sets pretty-printing for the tabulated
output of performance results; -DNUM_TRIALS specifies the number of executions
(default is 5); the modifier -DFISSION_3LOOPS enables fission optimization
using 3 segments for the inner stencil loop. The SUFFIX_DEF parameter sets the
suffix for the binary name, which may be useful when several optimization
specific binaries must coexist. Finally, the OPENMP parameter, shown in this
example, enables the parallel flags for OpenMP with the current compiler. Other
useful options are also supported by the Makefile, such as additional loop
optimizations and some tracing methods to obtain hardware counters.

   Once the stencil binaries have been generated, benchmarks can be run on the
target platform using the following command,

% ./probe nx ny nz tx ty tz timesteps length

where nx, ny and nz are the problem size, ordered from unit-stride to
least-stride dimension; and tx, ty and tz refer to the tiling parameters for
space-blocking algorithms. Finally, timesteps specifies the number of updates
to perform over the whole domain and length sets the l parameter of the stencil
operator.

   In addition, a group of interesting scripts are also included under the
script directory of the software package. These are designed to facilitate
benchmarking tasks such as: run a batch of tests in order to get timings and
hardware counters, and generate filtered and sorted results of best serial and
parallel times and their hardware counter metrics.

