As mentioned before, an efficient implementation of the BLAS masks the effects of the processor memory hierarchy and frees the programmer from local tuning of this basic kernel. The performance of the BLAS heavily depends on the number of memory references per floating point operation. This ratio naturally sorts the BLAS in three levels, where routines belonging to the same level usually reach similar execution rates. Consequently, the BLAS processes are, as far as performance analysis is concerned, able to perform only three instructions, corresponding to the three BLAS levels. The execution times per floating point operation of each of these instructions are then denoted by , with .