As mentioned before, an efficient implementation of the BLAS masks the
effects of the processor memory hierarchy and frees the programmer from
local tuning of this basic kernel. The performance of the BLAS heavily
depends on the number of memory references per floating point operation.
This ratio naturally sorts the BLAS in three levels, where routines
belonging to the same level usually reach similar execution rates.
Consequently, the BLAS processes are, as far as performance analysis is
concerned, able to perform only three instructions, corresponding to the
three BLAS levels. The execution times per floating point operation of
each of these instructions are then denoted by , with
.