Tools for Numerical Performance and Analysis
Automatically Tuned Linear Algebra Software - ATLAS
Clint Whaley
Antoine Petitet
Jack Dongarra
University of Tennessee
and
Oak Ridge National Laboratory
http://www.cs.utk.edu/~dongarra/

Chip Technology (Data from Top500 Report)
500 most powerful computer.
Updated every 6 months
Microprocessor based computers technology of choice
Result: HPC driven by commodity parts

Where Does the Performance Go? or
Why Should I Cares About the Memory Hierarchy?

 Memory Hierarchy
By taking advantage of the principle of locality:
Present the user with as much memory as is available in the cheapest technology.
Provide access at the speed offered by the fastest technology.

Performance
Software
Computing hardware doubles its speed every eighteen months.
Yet it often takes more than a year for software to be optimized or "tuned" for performance on a newly released CPU.
The job of optimizing software to exploit the special features of a given CPU has historically been an exercise in hand customization.

How To Get Performance          From Commodity  Processors?
Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.
Hardware and software have a large design space w/many parameters, performance sensitive to
Blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules.
Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.
Performance instability
About a year ago no tuned BLAS for Pentium for Linux.
Need for quick/dynamic deployment of optimized routines.
ATLAS - Automatic Tuned Linear Algebra Software
PhiPac from Berkeley
FFTW from MIT (http://www.fftw.org)

ATLAS
An adaptive software architecture
High-performance
Portability
Elegance
ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

ATLAS (DGEMM n = 500)
ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

Why ATLAS Is Fast?
ATLAS does not implement a single fixed algorithm.
The code is generated by a program that tests, probes, and runs 100’s of experiments on the target sw/hw architecture.
During installation the program generator determines an efficient implementation
measures the speed of different code strategies and chooses the best using an adaptive procedure.
This leads to a new model of high performance programming in which performance critical code is machine generated using parameter optimization.

Code Generation
 Strategy
Code is iteratively generated & timed until optimal case is found.  We try:
Differing NBs
Breaking false dependencies
M, N and K loop unrolling
On-chip multiply optimizes for:
TLB access
L1 cache reuse
FP unit usage
Memory fetch
Register reuse
Loop overhead minimization
Takes a 30 minutes to a hour to run.

Ax = b, n = 500, (LU Right-Looking)

Recursive Approach for               Other Level 3 BLAS
Recur down to L1 cache block size
Need kernel at bottom of recursion
Use gemm-based kernel for portability

500x500 Recursive Level 3 BLAS               on UltraSparc 2 200

500x500 Level 2 BLAS DGEMV

Multi-Threaded DGEMM
Intel PIII 550 MHz

ATLAS
Needs a reasonable C compiler and is focused on super scalar (RISC) architectures.
Available today: www.netlib.org/atlas/
Keep a repository of kernels for specific machines.
Extend work to allow sparse matrix operations
Extend work to include arbitrary  code segments

Gaussian Elimination

Gaussian Elimination via a Recursive Algorithm

Matlab Code (Without Pivoting)
function a=rlu(a)
[m,n]=size(a);
mn = min(m,n);
if mn > 1
    nl = bitshift(mn,-1);
% lu on left part
    a(1:m,1:nl)= rlu(a(1:m,1:nl));
% trangular solve
    a(1:nl,nl+1:n) = (tril(a(1:nl,1:nl))-diag(diag(a(1:nl,1:nl)))+eye(nl))\a(1:nl,nl+1:n);
% matrix multiple
    a(nl+1:m,nl+1:n) = a(nl+1:m,nl+1:n) - a(nl+1:m,1:nl)*a(1:nl,nl+1:n);
% lu on right part
    a(nl+1:m,nl+1:n) = rlu(a(nl+1:m,nl+1:n));
else
% mx1 case
   a(2:m,:) = a(2:m,:)/a(1,1);
end
return

Recursive Factorizations
Larger matrix-matrix operations performed
Automatic Blocking
Drive the recursion down to 1
No blocksize needed
Applies to Cholesky and QR factorization
QR needs some care because of triangular matrix in block Householder
Unclear if ideas applies to 2-sided algorithms (reduction for eigenvalue problem)

Next Release of ATLAS to Include Recursive LU and LLT
Will be part of the ATLAS distribution.
Will be created as a by product of the ATLAS make.
Same calling sequence as LAPACK.
Will be callable from Fortran, but written in C.

Recursive LU vs LAPACK/ATLAS

Recursive LLT vs LAPACK/ATLAS (Lower)

Future Plans for ATLAS
By December 24th Software Release:
Level 1 and 2 BLAS implementations
Multi-threading
Recursive LU and LLT (QR to come)
See: www.netlib.org/atlas/
Longer Term:
Runtime adaptation
Sparsity analysis
Iterative code improvement
Adaptive libraries
Specialization for user applications
Optimize message passing system
Extend these ideas to Java directly
Java LAPACK

PAPI - Definition
Performance Application Programming Interface
The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.

Performance Data (cont.)
I/D cache misses for different levels
Branch mispredictions
TLB misses
Pipeline stalls due to memory subsystem
Pipeline stalls due to resource conflicts
Cache invalidations
TLB invalidations
Load/store count
Instruction count
Cycle count
Floating point instruction count
Integer instruction count
Branch taken / not taken count

PAPI Platforms
Alpha
AIX/Power 3/604e/604
Linux/x86
IRIX/R10k,R12k
Unicos/21164
Soon
Solaris/x86,Ultra
Tru64/21x64

Show PAPI Demos

Contributors to These Ideas
Antoine Petitet, UTK - ATLAS
Clint Whaley, UTK - ATLAS
Phil Mucci, UTK - PAPI
Nathan Garner, UTK - PAPI
For additional information see…
icl.cs.utk.edu/
www.netlib.org/atlas/
www.netlib.org/utk/people/JackDongarra/
ATLAS received a 1999 R&D 100 Award