Tools for Numerical Performance and Analysis
Automatically Tuned Linear Algebra Software - ATLAS
|
|
|
Clint Whaley |
|
Antoine Petitet |
|
Jack Dongarra |
|
University of Tennessee |
|
and |
|
Oak Ridge National Laboratory |
|
http://www.cs.utk.edu/~dongarra/ |
|
|
Chip Technology (Data
from Top500 Report)
|
|
|
500 most powerful computer. |
|
Updated every 6 months |
|
Microprocessor based computers
technology of choice |
|
Result: HPC driven by commodity parts |
|
|
Where Does the
Performance Go? or
Why Should I Cares About the Memory Hierarchy?
Memory Hierarchy
|
|
|
|
By taking advantage of the principle of
locality: |
|
Present the user with as much memory as
is available in the cheapest technology. |
|
Provide access at the speed offered by
the fastest technology. |
Performance
Software
|
|
|
Computing hardware doubles its speed
every eighteen months. |
|
Yet it often takes more than a year for
software to be optimized or "tuned" for performance on a newly
released CPU. |
|
The job of optimizing software to
exploit the special features of a given CPU has historically been an exercise
in hand customization. |
How To Get
Performance From Commodity Processors?
|
|
|
|
Today’s processors can achieve
high-performance, but this requires extensive machine-specific hand tuning. |
|
Hardware and software have a large
design space w/many parameters, performance sensitive to |
|
Blocking sizes, loop nesting
permutations, loop unrolling depths, software pipelining strategies, register
allocations, and instruction schedules. |
|
Complicated interactions with the
increasingly sophisticated micro-architectures of new microprocessors. |
|
Performance instability |
|
About a year ago no tuned BLAS for
Pentium for Linux. |
|
Need for quick/dynamic deployment of
optimized routines. |
|
ATLAS - Automatic Tuned Linear Algebra
Software |
|
PhiPac from Berkeley |
|
FFTW from MIT (http://www.fftw.org) |
ATLAS
|
|
|
|
An adaptive software architecture |
|
High-performance |
|
Portability |
|
Elegance |
|
|
|
|
|
|
|
|
|
ATLAS is faster than all other portable
BLAS implementations and it is comparable with machine-specific libraries
provided by the vendor. |
ATLAS (DGEMM n = 500)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ATLAS is faster than all other portable
BLAS implementations and it is comparable with machine-specific libraries
provided by the vendor. |
Why ATLAS Is Fast?
|
|
|
|
ATLAS does not implement a single fixed
algorithm. |
|
The code is generated by a program that
tests, probes, and runs 100’s of experiments on the target sw/hw
architecture. |
|
During installation the program
generator determines an efficient implementation |
|
measures the speed of different code
strategies and chooses the best using an adaptive procedure. |
|
This leads to a new model of high
performance programming in which performance critical code is machine
generated using parameter optimization. |
Code
Generation
Strategy
|
|
|
|
Code is iteratively generated &
timed until optimal case is found. We
try: |
|
Differing NBs |
|
Breaking false dependencies |
|
M, N and K loop unrolling |
|
On-chip multiply optimizes for: |
|
TLB access |
|
L1 cache reuse |
|
FP unit usage |
|
Memory fetch |
|
Register reuse |
|
Loop overhead minimization |
|
Takes a 30 minutes to a hour to run. |
Ax = b, n = 500, (LU
Right-Looking)
Recursive Approach
for Other Level 3 BLAS
|
|
|
|
Recur down to L1 cache block size |
|
Need kernel at bottom of recursion |
|
Use gemm-based kernel for portability |
500x500 Recursive Level 3
BLAS on UltraSparc 2 200
500x500 Level 2 BLAS
DGEMV
Multi-Threaded
DGEMM
Intel PIII 550 MHz
ATLAS
|
|
|
Needs a reasonable C compiler and is
focused on super scalar (RISC) architectures. |
|
Available today: www.netlib.org/atlas/ |
|
Keep a repository of kernels for
specific machines. |
|
Extend work to allow sparse matrix
operations |
|
Extend work to include arbitrary code segments |
Gaussian Elimination
Gaussian Elimination via
a Recursive Algorithm
Matlab Code (Without
Pivoting)
|
|
|
function a=rlu(a) |
|
[m,n]=size(a); |
|
mn = min(m,n); |
|
if mn > 1 |
|
nl = bitshift(mn,-1); |
|
% lu on left part |
|
a(1:m,1:nl)= rlu(a(1:m,1:nl)); |
|
% trangular solve |
|
a(1:nl,nl+1:n) =
(tril(a(1:nl,1:nl))-diag(diag(a(1:nl,1:nl)))+eye(nl))\a(1:nl,nl+1:n); |
|
% matrix multiple |
|
a(nl+1:m,nl+1:n) = a(nl+1:m,nl+1:n) - a(nl+1:m,1:nl)*a(1:nl,nl+1:n); |
|
% lu on right part |
|
a(nl+1:m,nl+1:n) = rlu(a(nl+1:m,nl+1:n)); |
|
else |
|
% mx1 case |
|
a(2:m,:) = a(2:m,:)/a(1,1); |
|
end |
|
return |
|
|
|
|
Recursive Factorizations
|
|
|
|
Larger matrix-matrix operations
performed |
|
Automatic Blocking |
|
Drive the recursion down to 1 |
|
No blocksize needed |
|
Applies to Cholesky and QR
factorization |
|
QR needs some care because of
triangular matrix in block Householder |
|
Unclear if ideas applies to 2-sided
algorithms (reduction for eigenvalue problem) |
Next Release of ATLAS to
Include Recursive LU and LLT
|
|
|
|
Will be part of the ATLAS distribution. |
|
Will be created as a by product of the
ATLAS make. |
|
Same calling sequence as LAPACK. |
|
Will be callable from Fortran, but
written in C. |
Recursive LU vs
LAPACK/ATLAS
Recursive LLT
vs LAPACK/ATLAS (Lower)
Future Plans for ATLAS
|
|
|
|
|
By December 24th Software Release: |
|
Level 1 and 2 BLAS implementations |
|
Multi-threading |
|
Recursive LU and LLT (QR to
come) |
|
See: www.netlib.org/atlas/ |
|
Longer Term: |
|
Runtime adaptation |
|
Sparsity analysis |
|
Iterative code improvement |
|
Adaptive libraries |
|
Specialization for user applications |
|
Optimize message passing system |
|
Extend these ideas to Java directly |
|
Java LAPACK |
|
|
PAPI - Definition
|
|
|
|
|
Performance Application Programming Interface |
|
|
|
|
|
The purpose of the PAPI project is
to design, standardize and implement a portable and efficient API to access
the hardware performance monitor counters found on most modern
microprocessors. |
Performance Data (cont.)
|
|
|
|
I/D cache misses for different levels |
|
Branch mispredictions |
|
TLB misses |
|
Pipeline stalls due to memory subsystem |
|
Pipeline stalls due to resource
conflicts |
|
Cache invalidations |
|
TLB invalidations |
|
Load/store count |
|
Instruction count |
|
Cycle count |
|
Floating point instruction count |
|
Integer instruction count |
|
Branch taken / not taken count |
|
|
PAPI Platforms
|
|
|
|
Alpha |
|
AIX/Power 3/604e/604 |
|
Linux/x86 |
|
IRIX/R10k,R12k |
|
Unicos/21164 |
|
Soon |
|
Solaris/x86,Ultra |
|
Tru64/21x64 |
Show PAPI Demos
Contributors to These
Ideas
|
|
|
Antoine Petitet, UTK - ATLAS |
|
Clint Whaley, UTK - ATLAS |
|
Phil Mucci, UTK - PAPI |
|
Nathan Garner, UTK - PAPI |
|
|
|
For additional information see… |
|
icl.cs.utk.edu/ |
|
www.netlib.org/atlas/ |
|
www.netlib.org/utk/people/JackDongarra/ |
|
ATLAS received a 1999 R&D 100 Award |
|
|