[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: developer release 3.1.2



>dger only shows about +25%, due to the extreme cache
>pollution, I suppose.

I'd say its a natural result of writes as opposed to reads: the main cost of
Level 2 operations are the matrix costs.  For GEMV, the matrix is only read,
which means it must be fetched from main memory to L1.  For GER, on the other
hand, it must be fetched, but then written back out as well, thus doubling
the traffic on the various levels of memory, which corresponds nicely to
your 50 -> 25 . . .

>You had mentioned trying a gemv based gemm for the complex in an
>earlier message.  As a lark, I just tried that for the single
>precision.  I seem to get about as good as the standard atlas gemm
>(~350 MFLOPS, sgemv was ~ 250 MFLOPS), but the mmsearch did not pick
>my routine. You had also indicated that this strategy was not the best
>way to go, most likely.  Could you elaborate a bit on what would
>likely be needed beyond a loop over gemv?  It seems as though one
>cannot count on longer contiguous vectors than kb no matter what one
>does. 

The usual trick is to use register blocking to perform multiple dot prodocts
at once.  For instance, on a machine with 32 registers, ATLAS might figure
a 4x4 section of C (requiring 16 registers), loading 4 regs with elts of A,
4 with elts of B.  You then get reuse each elt of A against all four elts of B
that you have in registers, and vice versa.  With only 8 registers of the
x86, ATLAS uses 2 regs for elts of C, 2 for elts of A, and one for elts of B;
if you have more, you should be able to do something similar . . .

If you want to see how this works in practice, you can scope the matrix
generator.  In ATLAS/tune/blas/gemm/<arch>, you can issue a command like:
   make mmcase pre=d muladd=0 lat=4 nb=40 mu=4 nu=4 ku=1
and then scope the generated file dmm.c.  muladd=0 means the machine has
a seperate multiply and add unit, lat=4 means it takes 4 cycles for an
multiply or add to complete, nb=40 is the block factor, mu is the number of
registers used for A (also the rows of C), nu is the number of registers used
for B (also the cols of C).  For muladd=0, lat registers are used for
keeping the floating point pipe full, so the total number of registers used
by this code is:  4 (B) + 4 (A) + 16 (C) + 4 (lat) = 28 . . .

Anyway, keeping ku=1 makes things easier to read, and you can vary mu and
nu to get an idea of various register blockings . . .

Cheers,
Clint