The first line of each file is a comment line, and is ignored. The next
line indicates the number of user-contributed codes to search, and
each subsequent line supplies information about a given user-supplied
L1 matmul. The form of these lines is:
<ID> <flag> <mb> <nb> <kb> <muladd> <lat> <mu> <nu> <ku> <rout> "<author>"
<rout>
and <author>"
are strings, and the rest of the
parameters are signed integers.
The meaning of these parameters are:
ID
: Strictly positive integer which uniquely identifies this
descriptor line. ID must by unique only within a precision.
<flag>
: flag indicating special conditions. See table below.
<mb>
, <nb>
, <kb>
:
Used to indicate restriction on the input parameter (, resp.),
and its associated blocking MB (NB, KB, resp.).
If the value is zero, the internal routine handles any ; i.e.
the loop-limit is a runtime variable. If the value is negative, then
= MB = -<mb>
(i.e., the blocking factor cannot be
varied using a macro). If the value is positive, the blocking factor
can be varied by setting the appropriate macro
(MB NB, KB, resp.), but the blocking factor must be
a multiple of the value. Therefore, setting <mb>
= 4, indicates
that MB must be a multiple of 4, while setting it to 1 indicates
that MB is an arbitrary compile-time constant.
<muladd>
:
Set to zero if you are using separate multiply and add instructions, 1
otherwise. If you don't know the answer, put 1.
<lat>
:
Set to the latency you use between floating point instructions.
If you don't know the answer, put 1.
<mu>
:
Unrolling you are using for the loop.
<nu>
:
Unrolling you are using for the loop.
<ku>
:
Unrolling you are using for the loop.
<rout>
:
The filename of the user-contributed routine, relative to the path
ATLAS/tune/blas/gemm/CASES. Maximum length 64 chars.
<author>
:
The name of the author or authors, enclosed in quotes.
Maximum length 64 chars.
Table 1 summarizes the presently defined flag values.
|
Here's an example:
<ID> <flag> <mb> <nb> <kb> <muladd> <lat> <mu> <nu> <ku> <rout> "<Contributer>" 3 1 0 0 0 0 1 1 1 1 1 ATL_mm1x1x1.c "R. Clint Whaley" 2 0 1 1 1 1 1 1 1 1 ATL_mm1x1x1b.c "R. Clint Whaley" 3 0 1 1 8 1 1 1 1 4 ATL_mm2.c "R. Clint Whaley"
So, we have 3 user-supplied routines, all written by me. The first loops over , , and , but the following two routines loop over the cpp macros MB, NB, KB. The third routine insists that KB be a multiple of 8. The first two routines don't unroll any of the loops, while the third unrolls the K loop to a depth of 4. They all use a combined muladd style of programming, and don't worry about latency.