Next: Block Algorithms and Their Up: No Title Previous: Target Architectures

The BLAS as the Key to Portability

At least three factors affect the performance of portable Fortran code.

Vectorization. Designing vectorizable algorithms in linear algebra is usually straightforward. Indeed, for many computations there are several variants, all vectorizable, but with different characteristics in performance (see, for example, [15]). Linear algebra algorithms can approach the peak performance of many machines--principally because peak performance depends on some form of chaining of vector addition and multiplication operations, and this is just what the algorithms require. However, when the algorithms are realized in straightforward Fortran 77 code, the performance may fall well short of the expected level, usually because vectorizing Fortran compilers fail to minimize the number of memory references--that is, the number of vector load and store operations.
Data movement. What often limits the actual performance of a vector, or scalar, floating-point unit is the rate of transfer of data between different levels of memory in the machine. Examples include the transfer of vector operands in and out of vector registers, the transfer of scalar operands in and out of a high-speed scalar processor, the movement of data between main memory and a high-speed cache or local memory, paging between actual memory and disk storage in a virtual memory system, and interprocessor communication on a distributed memory concurrent computer.
Parallelism. The nested loop structure of most linear algebra algorithms offers considerable scope for loop-based parallelism. This is the principal type of parallelism that LAPACK and ScaLAPACK presently aim to exploit. On shared memory concurrent computers, this type of parallelism can sometimes be generated automatically by a compiler, but often requires the insertion of compiler directives. On distributed memory concurrent computers, data must be moved between processors. This is usually done by explicit calls to message passing routines, although parallel language extensions such as Coherent Parallel C [29] and Split-C [13] do the message passing implicitly.

The question arises, ``How can we achieve sufficient control over these three factors to obtain the levels of performance that machines can offer?'' The answer is through use of the BLAS.

There are now three levels of BLAS:

Level 1 BLAS [40]:: for vector operations, such as
Level 2 BLAS [18]:: for matrix-vector operations, such as
Level 3 BLAS [17]:: for matrix-matrix operations, such as .

Here, A, B and C are matrices, x and y are vectors, and

and

are scalars.

The Level 1 BLAS are used in LAPACK, but for convenience rather than for performance: they perform an insignificant fraction of the computation, and they cannot achieve high efficiency on most modern supercomputers.

The Level 2 BLAS can achieve near-peak performance on many vector processors, such as a single processor of a CRAY X-MP or Y-MP, or Convex C-2 machine. However, on other vector processors such as a CRAY-2 or an IBM 3090 VF, the performance of the Level 2 BLAS is limited by the rate of data movement between different levels of memory.

The Level 3 BLAS overcome this limitation. This third level of BLAS performs floating-point operations on data, whereas the Level 2 BLAS perform only operations on data. The Level 3 BLAS also allow us to exploit parallelism in a way that is transparent to the software that calls them. While the Level 2 BLAS offer some scope for exploiting parallelism, greater scope is provided by the Level 3 BLAS, as Table 1 illustrates.

table95
Table 1: Speed (Megaflops) of Level 2 and Level 3 BLAS Operations on a CRAY Y-MP. All matrices are of order 500; U is upper triangular.

Next: Block Algorithms and Their Up: No Title Previous: Target Architectures

Jack Dongarra
Sun Feb 9 10:05:05 EST 1997