The performance kernel for the entire Level 3 BLAS is matrix multiply. Matrix multiply is written in terms of a lower-level building block that we call the L1 matmul. The L1 matmul is a special matrix multiply where the input dimensions are fixed at , where the blocking factor is chosen in order to maximize L1 cache reuse.
ATLAS actually has two different L1 matmul kernels: one for copied matrices, and one that operates directly on the user's matrices. For matrices of sufficient size, ATLAS copies the input matrix into block-major storage. In block-major storage, the blocks operated on by the L1 matmul are actually contiguous. This optimization prevents unnecessary cache misses, cache conflicts, and TLB problems. However, for sufficiently small matrices, the cost of this data copy is prohibitively expensive, and thus ATLAS has kernels that operate on non-copied data. However, without the copy to simplify the process, there are multiple non-copy kernels (differing kernels for differing transpose settings, for instance). Since the non-copy kernels are typically only used for very small problems, and they are much more complex, ATLAS presently accepts contributed code only for the copy L1 matmul. For most problems, well over 98% of ATLAS time is spent in the copy L1 matmul, so this should not be much of a problem.