The most important file ATLAS/tune/blas/gemv/CASES
is the primitive description
file, <pre>cases.dsc
. Each precision has its own description file (as
indicated by <pre>
), and this file describes all of the routines to
time in order to find the best. For instance, for double precision, we see:
speedy. cat CASES/dcases.dsc 9 8 0 0 ATL_gemvN_mm.c "R. Clint Whaley" 0 1 1 ATL_gemvN_1x1_1.c "R. Clint Whaley" 16 32 1 ATL_gemvN_1x1_1a.c "R. Clint Whaley" 0 4 2 ATL_gemvN_4x2_0.c "R. Clint Whaley" 0 4 4 ATL_gemvN_4x4_1.c "R. Clint Whaley" 0 8 4 ATL_gemvN_8x4_1.c "R. Clint Whaley" 0 16 2 ATL_gemvN_16x2_1.c "R. Clint Whaley" 0 16 4 ATL_gemvN_16x4_1.c "R. Clint Whaley" 16 32 4 ATL_gemvN_32x4_1.c "R. Clint Whaley" 6 8 0 0 ATL_gemvT_mm.c "R. Clint Whaley" 0 2 8 ATL_gemvT_2x8_0.c "R. Clint Whaley" 0 4 8 ATL_gemvT_4x8_1.c "R. Clint Whaley" 0 4 16 ATL_gemvT_4x16_1.c "R. Clint Whaley" 0 2 16 ATL_gemvT_2x16_1.c "R. Clint Whaley" 0 1 1 ATL_gemvT_1x1_1.c "R. Clint Whaley"
The first number (in this case 9) is the number of NoTranspose primitives to time. This is followed by that number of primitive lines describing those NoTrans primitives, and then we supply the number of Transpose primitives to time (in this example, 6), followed by that number of primitive lines describing the Transpose primitives.
As you can see, each line supplies three integers and a filename to the search routine. The filename is the filename of the primitive to time. The three integers supply information necessary in order for the higher level routines to do blocking.
This is the first piece of important information about these primitive routines: no blocking should be done in them. The appropriate blocking is done by higher level ATLAS routines. Most primitives employ some kind of loop unrolling, and when these higher level routines block in order to reuse vectors or matrices, it is important that this blocking does not conflict with the primitives' unrolling factors (for instance, if the primitive unrolls a given dimension by 8, but ATLAS blocks that dimension to 3, ATLAS would always call the cleanup code). So this is the information conveyed by these three integers.
The form of a GEMV primitive line is:
<flag> <Yunroll> <Xunroll> <filename> "<author(s)>"
As mentioned previously, <filename>
is the primitive source file.
<Yunroll>
is the unrolling used for the loop that loops over the vector, and
<Xunroll>
is the unrolling used for the loop that loops over the
vector. <flag>
is a less obvious parameter which is used
to tell the search script about special properties of a kernel.
It is assumed that the user has supplied a "inner-product" based GEMV
implementation (i.e., an implementation which basically does <Yunroll>
simultaneous dot products). This default state is expressed to the search
by a <flag>
value of 0. However, since the inner product formulation of
NoTranspose GEMV loops across the non-contiguous dimension of the matrix,
some architectures need to employ an "outer-product" based NoTranspose GEMV
(i.e., a GEMV which is performed by doing <Xunroll>
simultaneous axpy's).
This is indicated by a <flag>
value of 16. Finally, since ATLAS's
GEMM has
a code generator which allows it to achieve very good portable performance,
it is always worth seeing how optimal a GEMV can be obtained by simply
making the appropriate call to GEMM. <flag>
of 8 indicates
that this is what the kernel is doing.
In summary:
FLAG | MEANING |
0 | Normal |
8 | GEMM-based primitive |
16 | Outer-product or AXPY-based primitive (only valid for Notranspose GEMV) |