|
|
|
Clint Whaley |
|
Antoine Petitet |
|
Jack Dongarra |
|
University of Tennessee |
|
and |
|
Oak Ridge National Laboratory |
|
http://www.cs.utk.edu/~dongarra/ |
|
|
|
|
|
|
500 most powerful computer. |
|
Updated every 6 months |
|
Microprocessor based computers technology of
choice |
|
Result: HPC driven by commodity parts |
|
|
|
|
|
|
|
|
By taking advantage of the principle of
locality: |
|
Present the user with as much memory as is
available in the cheapest technology. |
|
Provide access at the speed offered by the
fastest technology. |
|
|
|
|
Computing hardware doubles its speed every
eighteen months. |
|
Yet it often takes more than a year for software
to be optimized or "tuned" for performance on a newly released
CPU. |
|
The job of optimizing software to exploit the
special features of a given CPU has historically been an exercise in hand
customization. |
|
|
|
|
|
Today’s processors can achieve high-performance,
but this requires extensive machine-specific hand tuning. |
|
Hardware and software have a large design space
w/many parameters, performance sensitive to |
|
Blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies, register allocations, and
instruction schedules. |
|
Complicated interactions with the increasingly
sophisticated micro-architectures of new microprocessors. |
|
Performance instability |
|
About a year ago no tuned BLAS for Pentium for
Linux. |
|
Need for quick/dynamic deployment of optimized
routines. |
|
ATLAS - Automatic Tuned Linear Algebra Software |
|
PhiPac from Berkeley |
|
FFTW from MIT (http://www.fftw.org) |
|
|
|
|
|
An adaptive software architecture |
|
High-performance |
|
Portability |
|
Elegance |
|
|
|
|
|
|
|
|
|
ATLAS is faster than all other portable BLAS
implementations and it is comparable with machine-specific libraries
provided by the vendor. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ATLAS is faster than all other portable BLAS
implementations and it is comparable with machine-specific libraries
provided by the vendor. |
|
|
|
|
|
ATLAS does not implement a single fixed
algorithm. |
|
The code is generated by a program that tests,
probes, and runs 100’s of experiments on the target sw/hw architecture. |
|
During installation the program generator
determines an efficient implementation |
|
measures the speed of different code strategies
and chooses the best using an adaptive procedure. |
|
This leads to a new model of high performance
programming in which performance critical code is machine generated using
parameter optimization. |
|
|
|
|
|
Code is iteratively generated & timed until
optimal case is found. We try: |
|
Differing NBs |
|
Breaking false dependencies |
|
M, N and K loop unrolling |
|
On-chip multiply optimizes for: |
|
TLB access |
|
L1 cache reuse |
|
FP unit usage |
|
Memory fetch |
|
Register reuse |
|
Loop overhead minimization |
|
Takes a 30 minutes to a hour to run. |
|
|
|
|
|
|
Recur down to L1 cache block size |
|
Need kernel at bottom of recursion |
|
Use gemm-based kernel for portability |
|
|
|
|
|
|
|
Needs a reasonable C compiler and is focused on
super scalar (RISC) architectures. |
|
Available today: www.netlib.org/atlas/ |
|
Keep a repository of kernels for specific
machines. |
|
Extend work to allow sparse matrix operations |
|
Extend work to include arbitrary code segments |
|
|
|
|
|
|
function a=rlu(a) |
|
[m,n]=size(a); |
|
mn = min(m,n); |
|
if mn > 1 |
|
nl =
bitshift(mn,-1); |
|
% lu on left part |
|
a(1:m,1:nl)= rlu(a(1:m,1:nl)); |
|
% trangular solve |
|
a(1:nl,nl+1:n) =
(tril(a(1:nl,1:nl))-diag(diag(a(1:nl,1:nl)))+eye(nl))\a(1:nl,nl+1:n); |
|
% matrix multiple |
|
a(nl+1:m,nl+1:n) = a(nl+1:m,nl+1:n) - a(nl+1:m,1:nl)*a(1:nl,nl+1:n); |
|
% lu on right part |
|
a(nl+1:m,nl+1:n) = rlu(a(nl+1:m,nl+1:n)); |
|
else |
|
% mx1 case |
|
a(2:m,:) = a(2:m,:)/a(1,1); |
|
end |
|
return |
|
|
|
|
|
|
|
|
|
Larger matrix-matrix operations performed |
|
Automatic Blocking |
|
Drive the recursion down to 1 |
|
No blocksize needed |
|
Applies to Cholesky and QR factorization |
|
QR needs some care because of triangular matrix
in block Householder |
|
Unclear if ideas applies to 2-sided algorithms
(reduction for eigenvalue problem) |
|
|
|
|
|
Will be part of the ATLAS distribution. |
|
Will be created as a by product of the ATLAS
make. |
|
Same calling sequence as LAPACK. |
|
Will be callable from Fortran, but written in C. |
|
|
|
|
|
|
|
|
By December 24th Software Release: |
|
Level 1 and 2 BLAS implementations |
|
Multi-threading |
|
Recursive LU and LLT (QR to come) |
|
See: www.netlib.org/atlas/ |
|
Longer Term: |
|
Runtime adaptation |
|
Sparsity analysis |
|
Iterative code improvement |
|
Adaptive libraries |
|
Specialization for user applications |
|
Optimize message passing system |
|
Extend these ideas to Java directly |
|
Java LAPACK |
|
|
|
|
|
|
|
|
Performance Application Programming Interface |
|
|
|
|
|
The purpose of the PAPI project is to
design, standardize and implement a portable and efficient API to access
the hardware performance monitor counters found on most modern
microprocessors. |
|
|
|
|
|
I/D cache misses for different levels |
|
Branch mispredictions |
|
TLB misses |
|
Pipeline stalls due to memory subsystem |
|
Pipeline stalls due to resource conflicts |
|
Cache invalidations |
|
TLB invalidations |
|
Load/store count |
|
Instruction count |
|
Cycle count |
|
Floating point instruction count |
|
Integer instruction count |
|
Branch taken / not taken count |
|
|
|
|
|
|
|
Alpha |
|
AIX/Power 3/604e/604 |
|
Linux/x86 |
|
IRIX/R10k,R12k |
|
Unicos/21164 |
|
Soon |
|
Solaris/x86,Ultra |
|
Tru64/21x64 |
|
|
|
|
|
Antoine Petitet, UTK - ATLAS |
|
Clint Whaley, UTK - ATLAS |
|
Phil Mucci, UTK - PAPI |
|
Nathan Garner, UTK - PAPI |
|
|
|
For additional information see… |
|
icl.cs.utk.edu/ |
|
www.netlib.org/atlas/ |
|
www.netlib.org/utk/people/JackDongarra/ |
|
ATLAS received a 1999 R&D 100 Award |
|
|
|