Raúl de la Cruz (delacruz@bsc.es)
Computer Application department (CASE)
Barcelona Supercomputing Center (BSC) - Spain

README stencilprobe
===================

The Stencil Probe is a compact and self-contained serial microbenchmark.  This
microbenchmark was developed to explore the behavior of 3D stencil-based
computations without having to modify any application code.

The Stencil probe benchmark has been extended in order to benchmark different
kind of algorithms for Finite Difference methods (FD). Current available
algorithms are:

  * Naive
  * Rivera
  * Timeskewing
  * Cache oblivious
  * + Semi-stencil versions for all above algorithms

How to compile stencilprobe and obtain performance times:
---------------------------------------------------------

 1. Edit Makefile and select the best flags for your compiler and architecture:
    Currently, there are flags for AMD64, x86_64, POWER6, POWER7 and BlueG/P.
    For timing, enable the best choice for TIMER variable (available options are:
    PAPI, a specific cycle counter or GETTIMEOFDAY system call).

 2. Compile each specific version that you require:

    The following command (without specifing the binary target),

    % make OPTS="-DPLOT -DNUM_TRIALS=2 -DFISSION_3LOOPS" SUFFIX_DEF=".3loops.xlc"

    will generate the binaries for each algorithm setting the following options:

     +OPTS: Set specific options for compilation step.
        -DPLOT: Pretty-printing output for benchmark batch execution
        -DNUM_TRIALS: Number of executions to run for each benchmark (default is 5)
        -DFISSION_3LOOPS: Enable fission into 3 loops for the internal stencil loop.
                          Other options are (default -DFUSION):
                             -DFUSION: perform stencil computation in one loop.
                             -DFISSION_2LOOP: perform stencil computation in two loops.

     +SUFFIX_DEF: Add a suffix to binaries names. Useful if you want to generate
        several binaries with different optimizations. By default no suffix is added.

 3. Run tests in your architecture:
    An example of batch script is available in the repository for this purpose. This
    script runs different algorithmical binaries varying different parameters such
    as size, tiling parameters, stencil length and timesteps. This script can be found
    under scripts directory. Adapt this script to your needs.

    % ./batch.sh

 4. Generate your sorted data for further plotting:
    Your execution results will be stored in runs/ directory. Create a directory in runs/
    directory and move all generated output there (e.g. runs/jugene). Add an entry for
    your architecture in dirs, plat, pname and size arrays in scripts/generate-plots.pl script.
    Afterwards execute scripts/generate-plots.pl script. This script will generate timing
    data tables for all stencilprobe algorithms sorted out by platform and tiling parameters.

    % cd runs
    % mkdir jugene
    % mv *jugene* jugene
    % vi ../scripts/generate-plots.pl (add one entry for your platform directory specifying path and max size)
    % ../scripts/generate-plots.pl


How to obtain HWC metrics for stencilprobe benchmarks:
------------------------------------------------------

 1. Compile each specific hwc version that you require:

    The following command,

    % make hwc OPTS="-DPLOT -DPAPITRACE -DFISSION_3LOOPS" SUFFIX_DEF=".3loops.xlc"

    will generate the hardware counters binaries for each algorithm setting the following new options:

     +OPTS: Set specific options for compilation step.
        -DPAPITRACE: Use PAPI library directly to fetch hardware counters. You must enable
          PAPI support in Makefile. Moreover, you will need to set PAPI_COUNTERS environment
          variables in a comma-separated way with the PAPI counters that you want to obtain.
                          Other options are (default PAPIEX tool):
                            -DSEQTRACE: use EXTRAE package (useful in BG/P)
                            -DPAPIEX: use PAPIEX command tool to obtain hw counters

     +SUFFIX_HWC: Add some suffix to hwc binaries names. Useful if you want to generate
        several binaries with different optimizations. By default ".hwc" suffix is added.

 2. Run hwc tests in your architecture:
    Copy $platform.params file generated by scripts/generate-plots.pl script into hwc/ directory.
    This file contains the best configuration for each algorithm, size, stencil length and
    timestep.

    Configure run-best-hwc.pl script setting parameters file ($params) output file ($outfile)
    and the hardware counters to obtain (@sets). Please check before if each set contains
    compatible hwc counters for PAPI layer. This script was designed to run with -DPAPITRACE
    flag, but it can be easily adapted for other hwc tools.

    % ../scripts/run-best-hwc.pl

 3. Generate your sorted hwc data for further plotting:
    After running ./run-best-hwc.pl one 'results.$platform.hwc' file will be generated. This
    file contains the hardware counters for each best execution. Move to scripts/ directory and
    adapt generate-metrics.pl to your needs. This script will generate hwc tables for further
    plotting considering stencil lengths, algorithms and different kind of hwc metrics.
    You should add your platform parameters in @plat, @pname and @size variables in
    generate-metrics.pl script.

    % cd scripts
    % vi generate-metrics.pl (add one entry for your platform hwc)
    % cd ../hwc
    % ../scripts/generate-metrics.pl

    After running this perl script several files (ready for gnuplot) will be generated.
    Those files (results.$platname.$timesteps.$hwcmetrics) will contain information
    regarding different hwc counters for your platform varying the algorithms, the
    stencil lenghts and the timesteps.

