How To Get Performance From Commodity Processors?
Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.
Routines have a large design space w/many parameters
- Blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules.
- Complicated interactions with the increasingly sophisticated microarchitectures of new microprocessors.
- Very unstable
- small changes can have large changes on performance
ATLAS - Automatic Tuned Linear Algebra Software
- Adapts to differing architectures via code generation + timing
- PHiPAC from Berkeley
- FFTW from MIT