Tools for Numerical Performance and Analysis
Automatically Tuned Linear Algebra Software - ATLAS

Clint Whaley

Antoine Petitet

Jack Dongarra

University of Tennessee

and

Oak Ridge National Laboratory

http://www.cs.utk.edu/~dongarra/

Chip Technology (Data from Top500 Report)

500 most powerful computer.

Updated every 6 months

Microprocessor based computers technology of choice

Result: HPC driven by commodity parts

Where Does the Performance Go? or
Why Should I Cares About the Memory Hierarchy?

Memory Hierarchy

By taking advantage of the principle of locality:

Present the user with as much memory as is available in the cheapest technology.

Provide access at the speed offered by the fastest technology.

Performance
Software

Computing hardware doubles its speed every eighteen months.

Yet it often takes more than a year for software to be optimized or "tuned" for performance on a newly released CPU.

The job of optimizing software to exploit the special features of a given CPU has historically been an exercise in hand customization.

How To Get Performance From Commodity Processors?

Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.

Hardware and software have a large design space w/many parameters, performance sensitive to

Blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules.

Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.

Performance instability

About a year ago no tuned BLAS for Pentium for Linux.

Need for quick/dynamic deployment of optimized routines.

ATLAS - Automatic Tuned Linear Algebra Software

PhiPac from Berkeley

FFTW from MIT (http://www.fftw.org)

ATLAS

An adaptive software architecture

High-performance

Portability

Elegance

ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

ATLAS (DGEMM n = 500)

ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

Why ATLAS Is Fast?

ATLAS does not implement a single fixed algorithm.

The code is generated by a program that tests, probes, and runs 100’s of experiments on the target sw/hw architecture.

During installation the program generator determines an efficient implementation

measures the speed of different code strategies and chooses the best using an adaptive procedure.

This leads to a new model of high performance programming in which performance critical code is machine generated using parameter optimization.

Code Generation
Strategy

Code is iteratively generated & timed until optimal case is found. We try:

Differing NBs

Breaking false dependencies

M, N and K loop unrolling

On-chip multiply optimizes for:

TLB access

L1 cache reuse

FP unit usage

Memory fetch

Register reuse

Loop overhead minimization

Takes a 30 minutes to a hour to run.

Ax = b, n = 500, (LU Right-Looking)

Recursive Approach for Other Level 3 BLAS

Recur down to L1 cache block size

Need kernel at bottom of recursion

Use gemm-based kernel for portability

500x500 Recursive Level 3 BLAS on UltraSparc 2 200

500x500 Level 2 BLAS DGEMV

Multi-Threaded DGEMM
Intel PIII 550 MHz

ATLAS

Needs a reasonable C compiler and is focused on super scalar (RISC) architectures.

Available today: www.netlib.org/atlas/

Keep a repository of kernels for specific machines.

Extend work to allow sparse matrix operations

Extend work to include arbitrary code segments

Gaussian Elimination

Gaussian Elimination via a Recursive Algorithm

Matlab Code (Without Pivoting)

function a=rlu(a)

[m,n]=size(a);

mn = min(m,n);

if mn > 1

    nl = bitshift(mn,-1);

% lu on left part

    a(1:m,1:nl)= rlu(a(1:m,1:nl));

% trangular solve

    a(1:nl,nl+1:n) = (tril(a(1:nl,1:nl))-diag(diag(a(1:nl,1:nl)))+eye(nl))\a(1:nl,nl+1:n);

% matrix multiple

    a(nl+1:m,nl+1:n) = a(nl+1:m,nl+1:n) - a(nl+1:m,1:nl)*a(1:nl,nl+1:n);

% lu on right part

    a(nl+1:m,nl+1:n) = rlu(a(nl+1:m,nl+1:n));

else

% mx1 case

   a(2:m,:) = a(2:m,:)/a(1,1);

end

return

Recursive Factorizations

Larger matrix-matrix operations performed

Automatic Blocking

Drive the recursion down to 1

No blocksize needed

Applies to Cholesky and QR factorization

QR needs some care because of triangular matrix in block Householder

Unclear if ideas applies to 2-sided algorithms (reduction for eigenvalue problem)

Next Release of ATLAS to Include Recursive LU and LL^T

Will be part of the ATLAS distribution.

Will be created as a by product of the ATLAS make.

Same calling sequence as LAPACK.

Will be callable from Fortran, but written in C.

Recursive LU vs LAPACK/ATLAS

Recursive LL^T vs LAPACK/ATLAS (Lower)

Future Plans for ATLAS

By December 24th Software Release:

Level 1 and 2 BLAS implementations

Multi-threading

Recursive LU and LL^T(QR to come)

See: www.netlib.org/atlas/

Longer Term:

Runtime adaptation

Sparsity analysis

Iterative code improvement

Adaptive libraries

Specialization for user applications

Optimize message passing system

Extend these ideas to Java directly

Java LAPACK

PAPI - Definition

Performance Application Programming Interface

The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.

Performance Data (cont.)

I/D cache misses for different levels

Branch mispredictions

TLB misses

Pipeline stalls due to memory subsystem

Pipeline stalls due to resource conflicts

Cache invalidations

TLB invalidations

Load/store count

Instruction count

Cycle count

Floating point instruction count

Integer instruction count

Branch taken / not taken count

PAPI Platforms

Alpha

AIX/Power 3/604e/604

Linux/x86

IRIX/R10k,R12k

Unicos/21164

Soon

Solaris/x86,Ultra

Tru64/21x64

Show PAPI Demos

Contributors to These Ideas

Antoine Petitet, UTK - ATLAS

Clint Whaley, UTK - ATLAS

Phil Mucci, UTK - PAPI

Nathan Garner, UTK - PAPI

For additional information see…

icl.cs.utk.edu/

www.netlib.org/atlas/

www.netlib.org/utk/people/JackDongarra/

ATLAS received a 1999 R&D 100 Award


	Clint Whaley
	Antoine Petitet
	Jack Dongarra
	University of Tennessee
	and
	Oak Ridge National Laboratory
	http://www.cs.utk.edu/~dongarra/


	500 most powerful computer.
	Updated every 6 months
	Microprocessor based computers technology of choice
	Result: HPC driven by commodity parts


	By taking advantage of the principle of locality:
		Present the user with as much memory as is available in the cheapest technology.
		Provide access at the speed offered by the fastest technology.


	Computing hardware doubles its speed every eighteen months.
	Yet it often takes more than a year for software to be optimized or "tuned" for performance on a newly released CPU.
	The job of optimizing software to exploit the special features of a given CPU has historically been an exercise in hand customization.


	Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.
	Hardware and software have a large design space w/many parameters, performance sensitive to
		Blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules.
		Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.
		Performance instability
	About a year ago no tuned BLAS for Pentium for Linux.
	Need for quick/dynamic deployment of optimized routines.
	ATLAS - Automatic Tuned Linear Algebra Software
		PhiPac from Berkeley
		FFTW from MIT (http://www.fftw.org)


	An adaptive software architecture
		High-performance
		Portability
		Elegance




	ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.


	ATLAS does not implement a single fixed algorithm.
	The code is generated by a program that tests, probes, and runs 100’s of experiments on the target sw/hw architecture.
	During installation the program generator determines an efficient implementation
		measures the speed of different code strategies and chooses the best using an adaptive procedure.
	This leads to a new model of high performance programming in which performance critical code is machine generated using parameter optimization.


	Code is iteratively generated & timed until optimal case is found. We try:
		Differing NBs
		Breaking false dependencies
		M, N and K loop unrolling
	On-chip multiply optimizes for:
		TLB access
		L1 cache reuse
		FP unit usage
		Memory fetch
		Register reuse
		Loop overhead minimization
	Takes a 30 minutes to a hour to run.


	Recur down to L1 cache block size
	Need kernel at bottom of recursion
		Use gemm-based kernel for portability


	Needs a reasonable C compiler and is focused on super scalar (RISC) architectures.
	Available today: www.netlib.org/atlas/
	Keep a repository of kernels for specific machines.
	Extend work to allow sparse matrix operations
	Extend work to include arbitrary code segments


	function a=rlu(a)
	[m,n]=size(a);
	mn = min(m,n);
	if mn > 1
	nl = bitshift(mn,-1);
	% lu on left part
	a(1:m,1:nl)= rlu(a(1:m,1:nl));
	% trangular solve
	a(1:nl,nl+1:n) = (tril(a(1:nl,1:nl))-diag(diag(a(1:nl,1:nl)))+eye(nl))\a(1:nl,nl+1:n);
	% matrix multiple
	a(nl+1:m,nl+1:n) = a(nl+1:m,nl+1:n) - a(nl+1:m,1:nl)*a(1:nl,nl+1:n);
	% lu on right part
	a(nl+1:m,nl+1:n) = rlu(a(nl+1:m,nl+1:n));
	else
	% mx1 case
	a(2:m,:) = a(2:m,:)/a(1,1);
	end
	return


	Larger matrix-matrix operations performed
	Automatic Blocking
		Drive the recursion down to 1
		No blocksize needed
	Applies to Cholesky and QR factorization
		QR needs some care because of triangular matrix in block Householder
	Unclear if ideas applies to 2-sided algorithms (reduction for eigenvalue problem)


	Will be part of the ATLAS distribution.
		Will be created as a by product of the ATLAS make.
	Same calling sequence as LAPACK.
		Will be callable from Fortran, but written in C.


By December 24th Software Release:
	Level 1 and 2 BLAS implementations
	Multi-threading
	Recursive LU and LL^T(QR to come)
	See: www.netlib.org/atlas/
Longer Term:
	Runtime adaptation
		Sparsity analysis
		Iterative code improvement
	Adaptive libraries
	Specialization for user applications
	Optimize message passing system
	Extend these ideas to Java directly
		Java LAPACK



	Performance Application Programming Interface


	The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.


		I/D cache misses for different levels
		Branch mispredictions
		TLB misses
		Pipeline stalls due to memory subsystem
		Pipeline stalls due to resource conflicts
		Cache invalidations
		TLB invalidations
		Load/store count
		Instruction count
		Cycle count
		Floating point instruction count
		Integer instruction count
		Branch taken / not taken count


	Alpha
		AIX/Power 3/604e/604
		Linux/x86
		IRIX/R10k,R12k
		Unicos/21164
	Soon
		Solaris/x86,Ultra
		Tru64/21x64


	Antoine Petitet, UTK - ATLAS
	Clint Whaley, UTK - ATLAS
	Phil Mucci, UTK - PAPI
	Nathan Garner, UTK - PAPI

	For additional information see…
	icl.cs.utk.edu/
	www.netlib.org/atlas/
	www.netlib.org/utk/people/JackDongarra/
	ATLAS received a 1999 R&D 100 Award