Next: Block Algorithms and Their
Up: No Title
Previous: Target Architectures
At least three factors affect the performance of portable Fortran code.
-
Vectorization. Designing vectorizable algorithms in linear algebra is usually straightforward.
Indeed, for many computations there are several variants, all vectorizable, but
with different characteristics in performance (see, for example,
[15]).
Linear algebra algorithms can approach the peak performance of
many machines--principally because peak performance depends on some form of
chaining of vector addition and multiplication operations, and this is just
what the algorithms require.
However, when the algorithms are realized in straightforward Fortran 77 code,
the performance may fall well short of the expected level, usually because
vectorizing Fortran compilers fail to minimize the number of memory
references--that is, the number of vector load and store operations.
-
Data movement.
What often limits the actual performance of a vector, or scalar,
floating-point unit is the rate of transfer of data between different levels
of memory in the machine. Examples include the transfer of vector operands
in and out of vector registers, the transfer of
scalar operands in and out of a
high-speed scalar processor, the movement of data between main memory and a
high-speed cache or local memory, paging between
actual memory and disk storage in a virtual memory system, and
interprocessor communication on a distributed memory concurrent computer.
-
Parallelism.
The nested loop structure of most linear algebra algorithms offers
considerable scope for loop-based parallelism.
This is the principal type of parallelism that LAPACK and ScaLAPACK presently
aim to exploit. On shared memory concurrent computers, this type of
parallelism can sometimes be generated automatically
by a compiler, but often requires the insertion of compiler directives.
On distributed memory concurrent computers, data must be moved between
processors. This is usually done by explicit calls to message passing routines,
although parallel language extensions such as Coherent Parallel C
[29] and Split-C [13] do the message passing
implicitly.
The question arises,
``How can we achieve sufficient control over
these three factors
to obtain the levels of performance that machines can offer?''
The answer is through use of the BLAS.
There are now three levels of BLAS:
- Level 1 BLAS [40]:
- for vector operations, such as
- Level 2 BLAS [18]:
- for matrix-vector operations, such as
- Level 3 BLAS [17]:
- for matrix-matrix operations, such as
.
Here, A, B and C are matrices, x and y are vectors, and and
are scalars.
The Level 1 BLAS are used in LAPACK,
but for convenience rather than
for performance: they perform an insignificant fraction of the computation,
and they cannot achieve high efficiency on most modern supercomputers.
The Level 2 BLAS can achieve near-peak performance
on many vector processors,
such as a single processor of a CRAY X-MP or Y-MP, or Convex C-2 machine.
However, on other vector processors such as a CRAY-2 or an IBM 3090 VF,
the performance of the Level 2 BLAS is limited by the rate of data movement between different
levels of memory.
The Level 3 BLAS overcome this
limitation.
This third level of BLAS
performs
floating-point operations on data, whereas the Level 2 BLAS
perform only operations on data.
The Level 3 BLAS also allow us to exploit parallelism in a way that is transparent
to the software that calls them.
While the Level 2 BLAS offer some scope for exploiting parallelism, greater
scope is provided by the Level 3 BLAS, as Table 1 illustrates.
Table 1: Speed (Megaflops) of Level 2 and Level 3 BLAS Operations on a
CRAY Y-MP. All matrices are of order 500; U is upper triangular.
Next: Block Algorithms and Their
Up: No Title
Previous: Target Architectures
Jack Dongarra
Sun Feb 9 10:05:05 EST 1997