Table of Contents
ScaLAPACK Tutorial
Outline
Outline continued
Introduction
High-Performance Computing Today
Growth of Microprocessor Performance
Scalable Multiprocessors
Performance Numbers on RISC Processors
The Maturation of Highly Parallel Technology
Architecture Alternatives
Directions
Challenges in Developing Distributed Memory Libraries
ScaLAPACK Project Overview
PPT Slide
ScaLAPACK Team
Scalable Parallel Library for Numerical Linear Algebra
NLA - Software Development
NLA - ScaLAPACK
Goals - Port LAPACK to Distributed-Memory Environments.
ScaLAPACK Team
Programming Style
Overall Structure of Software
PBLAS
ScaLAPACK Structure
Parallelism in ScaLAPACK
Heterogeneous Computing
Prototype Codes
ATLAS & PhiPAC Projects(Automatically Tuned Linear Algebra Software)
PPT Slide
ATLAS -- References
Out of Core Software Approach
Out-of-Core Performance
HPF Version
HPF Version
ScaLAPACK - Ongoing Work
Direct Sparse Solvers
Sparse Gaussian Elimination
Super LU
Parallel Sparse Eigenvalue Solvers
Netlib downloads for ScaLAPACK material
Java
JAVA
LAPACK to JAVA
Parameterized Libraries
Motivation for Network Enabled Solvers
PPT Slide
NetSolve -- References
User Applications
Impact -- Applications
ScaLAPACK in ASCI Application Amounts To Saving of $1.1M - $5.4M
Interaction with ASCI at Caltech
Ocean Circulation Model
ScaLAPACK Software Hierarchy
Basic Linear Algebra Subprograms (BLAS)
BLAS -- Introduction
Memory Hierarchy
Level 1, 2 and 3 BLAS
Why Higher Level BLAS?
BLAS for Performance
BLAS -- References
BLAS Papers
BLAS Technical Forum http://www.netlib.org/utk/papers/blast-forum.html
Linear Algebra PACKage (LAPACK)
EISPACK and LINPACK
History of Block-Partitioned Algorithms
Block-Partitioned Algorithms
LAPACK
LAPACK
Derivation of Blocked AlgorithmsCholesky Factorization A = UTU
LINPACK Implementation
LAPACK Implementation
Derivation of Blocked Algorithms
LAPACK Blocked Algorithms
LAPACK Contents
LAPACK -- Motivations
LAPACK -- Release 3.0
LAPACK Ongoing Work
LAPACK -- Summary
LAPACK -- Summary contd
LAPACK -- References
Basic Linear Algebra Communication Subprograms (BLACS)
BLACS -- Introduction
BLACS -- Intro contd.
BLACS -- Basics
BLACS -- Basics
BLACS -- Basics
BLACS -- Basics
BLACS -- Basics
BLACS -- Communication Routines
BLACS -- Point to Point
BLACS -- Communication Routines
BLACS -- Broadcast
BLACS -- Combine Operations
BLACS -- Combine operations
BLACS -- Combine
BLACS -- Combine
BLACS -- Advanced Topics
BLACS -- Advanced Topics
BLACS -- Advanced Topics
BLACS -- Advanced Topics
BLACS -- Advanced Topics
BLACS -- Advanced Topics
BLACS -- Advanced Topics
BLACS -- Advanced Topics
BLACS -- Advanced Topics
BLACS -- Example Programs
BLACS -- References
Parallel Basic Linear Algebra Subprograms (PBLAS)
PBLAS -- Introduction
Scope of the PBLAS
Scope of the PBLAS
PBLAS -- Naming Conventions
PBLAS -- Naming Conventions
PBLAS
PBLAS -- Syntax
Data Distributions
PBLAS -- Storage Conventions
Distribution and Storage
PBLAS -- Storage Conventions
PBLAS -- Auxiliary Subprograms
PBLAS -- Auxiliary Subprograms
PBLAS -- Rationale
PBLAS -- Rationale contd
PBLAS -- Examples
PBLAS -- Examples
PBLAS -- Example Programs
PBLAS -- Example Programs
Features of PBLAS V2 ALPHA
Features of PBLAS V2 ALPHA
Features of PBLAS V2 ALPHA
Performance of PBLAS V2 ALPHA
Performance of PBLAS V2 ALPHA
PBLAS -- References
Design of ScaLAPACK
ScaLAPACK Structure
Goals - Port LAPACK to Distributed-Memory Environments.
Object-Based Design in Fortran77
Array Descriptors
Choosing a Data Distribution
Possible Data Layouts
Two-dimensional Block-Cyclic Distribution
Two-dimensional Block-Cyclic Distribution
Array descriptor for Dense Matrices
Narrow Band and Tridiagonal Matrices
Array descriptor for Narrow Band Matrices
Array descriptor for Right Hand Sides for Narrow Band Linear Solvers
Error Handling
Application Debugging Hints
ScaLAPACK Implementation
Functionality
Functionality continued
Parallelism in ScaLAPACK
Documentation, Test Suites, Example Programs, ...
Commercial Use
ScaLAPACK Performance
Target Machines for ScaLAPACK
Scalability -- Introduction
Scalability -- Introduction
Scalability
Achieving High Performance
Achieving High Performance on a Distributed-Memory Computer
Achieving High Performance on a Distributed-Memory Computer
Achieving High Performance on a Network of Workstations
Achieving High Performance on a Network of Workstations
Obtaining High Performance
PPT Slide
Details of Cluster timings
Performance
Details of SP2 timings
Performance
Details of Paragon timings
Performance
LU Performance (Mflop/s) on 32 nodes Intel XP/S MP Paragon
Performance of LU fact. + solve on the Intel MP XP/S Paragon (Mflop/s) (2 computational processors per node)
ScaLAPACK Example Programs
ScaLAPACK Example Program #1
ScaLAPACK Example Program #2
Issues of Heterogeneous Computing
Heterogeneous Computing
Heterogeneous Computing
Homogeneous Versus Heterogeneous
Homogeneous Versus Heterogeneous
Homogeneous Versus Heterogeneous
Heterogeneous Computing Issues
Communicating on IEEE Machines
Machine Precision
Heterogeneous Machine Precision
Other Machine Parameters
Heterogeneous Networks -- Arithmetic Issues
Algorithmic Integrity -- Examples
QR Algorithm for a Tridiagonal Matrix
Heterogeneous Conclusions
HPF Interface to ScaLAPACK
HPF Version
HPF Version
HPF Interface -- Note
HPF -- Redistribution
HPF -- Redistribution
Determining Distribution
Determining Distribution
Calling Fortran77 From HPF
Calling Fortran77 From HPF
Calling Fortran77 From HPF
HPF Interface -- Summary
HPF Interface -- Summary
HPF Interface -- Summary
HPF Performance for LU on 12-node Cluster Sun Ultra using PGI Compiler
Future Directions
ScaLAPACK -- Ongoing Work
Conclusions
ScaLAPACK Summary
ScaLAPACK Summary
ScaLAPACK Team
ScaLAPACK -- References
|