Highly efficient machine-specific implementations of the BLAS are available for many modern high-performance computers. Users who cannot obtain an efficient BLAS for your architecture may be able to create one from by using a set of BLAS that requires only an efficient implementation of the matrix-matrix multiply BLAS routine xGEMM [35, 90], combined with an automatically generated machine-specific and efficient implementation of xGEMM [16].
Users who are using one of the computers listed in this chapter should refer to Tables 5.2 and 5.3 to see which library we used for timing. Otherwise, the computer vendor may be able to provide information about optimized BLAS for a specific computer.
A reference Fortran 77 implementation of the BLAS is available from the blas directory on netlib.
http://www.netlib.org/blas/blas.shar