ScaLAPACK Tutorial

2/17/98


Click here to start


Table of Contents

ScaLAPACK Tutorial

Outline

Outline continued

Introduction

High-Performance Computing Today

Growth of Microprocessor Performance

Scalable Multiprocessors

Performance Numbers on RISC Processors

The Maturation of Highly Parallel Technology

Architecture Alternatives

Directions

Challenges in Developing Distributed Memory Libraries

ScaLAPACK Project Overview

PPT Slide

ScaLAPACK Team

Scalable Parallel Library for Numerical Linear Algebra

NLA - Software Development

NLA - ScaLAPACK

Goals - Port LAPACK to Distributed-Memory Environments.

ScaLAPACK Team

Programming Style

Overall Structure of Software

PBLAS

ScaLAPACK Structure

Parallelism in ScaLAPACK

Heterogeneous Computing

Prototype Codes

ATLAS & PhiPAC Projects (Automatically Tuned Linear Algebra Software)

PPT Slide

ATLAS -- References

Out of Core Software Approach

Out-of-Core Performance

HPF Version

HPF Version

ScaLAPACK - Ongoing Work

Direct Sparse Solvers

Sparse Gaussian Elimination

Super LU

Parallel Sparse Eigenvalue Solvers

Netlib downloads for ScaLAPACK material

Java

JAVA

LAPACK to JAVA

Parameterized Libraries

Motivation for Network Enabled Solvers

PPT Slide

NetSolve -- References

User Applications

Impact -- Applications

ScaLAPACK in ASCI Application Amounts To Saving of $1.1M - $5.4M

Interaction with ASCI at Caltech

Ocean Circulation Model

ScaLAPACK Software Hierarchy

Basic Linear Algebra Subprograms (BLAS)

BLAS -- Introduction

Memory Hierarchy

Level 1, 2 and 3 BLAS

Why Higher Level BLAS?

BLAS for Performance

BLAS -- References

BLAS Papers

BLAS Technical Forum http://www.netlib.org/utk/papers/blast-forum.html

Linear Algebra PACKage (LAPACK)

EISPACK and LINPACK

History of Block-Partitioned Algorithms

Block-Partitioned Algorithms

LAPACK

LAPACK

Derivation of Blocked Algorithms Cholesky Factorization A = UTU

LINPACK Implementation

LAPACK Implementation

Derivation of Blocked Algorithms

LAPACK Blocked Algorithms

LAPACK Contents

LAPACK -- Motivations

LAPACK -- Release 3.0

LAPACK Ongoing Work

LAPACK -- Summary

LAPACK -- Summary contd

LAPACK -- References

Basic Linear Algebra Communication Subprograms (BLACS)

BLACS -- Introduction

BLACS -- Intro contd.

BLACS -- Basics

BLACS -- Basics

BLACS -- Basics

BLACS -- Basics

BLACS -- Basics

BLACS -- Communication Routines

BLACS -- Point to Point

BLACS -- Communication Routines

BLACS -- Broadcast

BLACS -- Combine Operations

BLACS -- Combine operations

BLACS -- Combine

BLACS -- Combine

BLACS -- Advanced Topics

BLACS -- Advanced Topics

BLACS -- Advanced Topics

BLACS -- Advanced Topics

BLACS -- Advanced Topics

BLACS -- Advanced Topics

BLACS -- Advanced Topics

BLACS -- Advanced Topics

BLACS -- Advanced Topics

BLACS -- Example Programs

BLACS -- References

Parallel Basic Linear Algebra Subprograms (PBLAS)

PBLAS -- Introduction

Scope of the PBLAS

Scope of the PBLAS

PBLAS -- Naming Conventions

PBLAS -- Naming Conventions

PBLAS

PBLAS -- Syntax

Data Distributions

PBLAS -- Storage Conventions

Distribution and Storage

PBLAS -- Storage Conventions

PBLAS -- Auxiliary Subprograms

PBLAS -- Auxiliary Subprograms

PBLAS -- Rationale

PBLAS -- Rationale contd

PBLAS -- Examples

PBLAS -- Examples

PBLAS -- Example Programs

PBLAS -- Example Programs

Features of PBLAS V2 ALPHA

Features of PBLAS V2 ALPHA

Features of PBLAS V2 ALPHA

Performance of PBLAS V2 ALPHA

Performance of PBLAS V2 ALPHA

PBLAS -- References

Design of ScaLAPACK

ScaLAPACK Structure

Goals - Port LAPACK to Distributed-Memory Environments.

Object-Based Design in Fortran77

Array Descriptors

Choosing a Data Distribution

Possible Data Layouts

Two-dimensional Block-Cyclic Distribution

Two-dimensional Block-Cyclic Distribution

Array descriptor for Dense Matrices

Narrow Band and Tridiagonal Matrices

Array descriptor for Narrow Band Matrices

Array descriptor for Right Hand Sides for Narrow Band Linear Solvers

Error Handling

Application Debugging Hints

ScaLAPACK Implementation

Functionality

Functionality continued

Parallelism in ScaLAPACK

Documentation, Test Suites, Example Programs, ...

Commercial Use

ScaLAPACK Performance

Target Machines for ScaLAPACK

Scalability -- Introduction

Scalability -- Introduction

Scalability

Achieving High Performance

Achieving High Performance on a Distributed-Memory Computer

Achieving High Performance on a Distributed-Memory Computer

Achieving High Performance on a Network of Workstations

Achieving High Performance on a Network of Workstations

Obtaining High Performance

PPT Slide

Details of Cluster timings

Performance

Details of SP2 timings

Performance

Details of Paragon timings

Performance

LU Performance (Mflop/s) on 32 nodes Intel XP/S MP Paragon

Performance of LU fact. + solve on the Intel MP XP/S Paragon (Mflop/s) (2 computational processors per node)

ScaLAPACK Example Programs

ScaLAPACK Example Program #1

ScaLAPACK Example Program #2

Issues of Heterogeneous Computing

Heterogeneous Computing

Heterogeneous Computing

Homogeneous Versus Heterogeneous

Homogeneous Versus Heterogeneous

Homogeneous Versus Heterogeneous

Heterogeneous Computing Issues

Communicating on IEEE Machines

Machine Precision

Heterogeneous Machine Precision

Other Machine Parameters

Heterogeneous Networks -- Arithmetic Issues

Algorithmic Integrity -- Examples

QR Algorithm for a Tridiagonal Matrix

Heterogeneous Conclusions

HPF Interface to ScaLAPACK

HPF Version

HPF Version

HPF Interface -- Note

HPF -- Redistribution

HPF -- Redistribution

Determining Distribution

Determining Distribution

Calling Fortran77 From HPF

Calling Fortran77 From HPF

Calling Fortran77 From HPF

HPF Interface -- Summary

HPF Interface -- Summary

HPF Interface -- Summary

HPF Performance for LU on 12-node Cluster Sun Ultra using PGI Compiler

Future Directions

ScaLAPACK -- Ongoing Work

Conclusions

ScaLAPACK Summary

ScaLAPACK Summary

ScaLAPACK Team

ScaLAPACK -- References

Author: Susan Blackford

Email: scalapack@cs.utk.edu

Home Page: http://www.netlib.org/scalapack/