[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: worth a flier



Julian,

I cc atlas-comm; I think other developers might be interested in our
discussions on gemm . . .

Most of what I say is discussed in more detail in ATLAS/doc/atlas_over.ps,
so I will supplement my description with a page referenge such as (pp 11-13)
for interested parties who don't understand my quick and dirty
explanations . . .

>Correct - IPC = (fp) instructions per cycle

cool.  Needed to be sure you weren't also including the integer component
of gemm (loops and pointer arith) in the IPC . . .

>OK, the prefetches for gemm can be done with a simple calculation within mm. But is it also
>true when using the mm kernel in other routines than gemm? 

Yes.  The other Level 3 BLAS all call the full gemm, not the kernel, and they
will therefore have the full gemm's access patterns . . .

>Interesting that ATLAS does not copy C! Without having taken a look on the ATLAS documentation,
>I ask myself if ATLAS does not create worse cacheline conflicts with this strategy, exactly
>in that matrix, whose blocks must stay in the caches for the longest time (blocks of C).
>But I am sure that you have tested, if the cost of copying C dominates the possible loss caused
>by cacheline conflicts...

I think we need to get some common vocabulary to discuss this.  Gemm needs to
be written with 3 loops.  If C is MxN, and A and B are MxK and KxN, then
let us call the loops over a particular dimension M, N, and K loop.  In
all ATLAS gemm (both kernel and full) implementations K is the innermost loop.
For the kernel, which is M=N=K=NB, this means that the access of C is only
2*NB*NB, while the access of A and B are NB*NB*NB each, so C is a low order
term on cost.  Also, LAPACK is a big user of the BLAS, and they call gemm
a lot as a rank-K update (think of M=1000, N=1000, K=NB).  In this
configuration, the cost of reading and writing C is one of the main costs
of GEMM, and if you copy, you double it, which is a killer; These two factors
together (C low-order term, and copy killing perf of rank-NB update) are
why ATLAS does not guarantee copying C.  ATLAS actually *does* buffer C into
a contiguous NBxNB workspace when K is sufficiently large . . .
     (pp 11-13)

There are two l1 caching strategies that can yield good performance,
one input matrix block, and a few rows/cols of other input matrix fills cache,
and all 3 blocks (A, B, C) fill cache (pp. 18).  On the Athlon, ATLAS
presently gets better performance with the first of these, so C is essentially
kept in L2 . . .

>Now something serious: I am not sure if I can adopt our algorithm to C=At*B+C (Fortran column-major) as
>required by ATLAS. Our algorithm was created for C=A*B+C (C row-major). 

I assume all matrices are row-major?  If so, then in col-major terms, you
are doing C^T += B^T A^T, so we have a mismatch on access of C and the
second input matrix.  See below for more details . . .

>If you wonder why - we
>are crunching six scalar products simultanously, which is perfectly suitable for the x86 fp stack,
>reduces L1 bandwidth pressure (~2 dploads/cyc  -> ~1.2 dploads/cyc), causes shorter instructions and which brings us row-access on B
>(at least for six iterations. Copying B in a slighly different results in full row access of B).

OK, just to be sure, when you say scalar products, this is the Level 1 BLAS
operation DDOT?  If so, this is exactly what ATLAS's generated kernel is
doing: 5 simultaneous dot products (6 actually gives almost the same
performance).

For any gemm with K as the inner loop, you do some number of simultaneous
dot products.  if M or N is the inner loop, you do some number of simultaneous
AXPYs (Level 1 operation AXPY).

So, with K as the inner loop, and matrices column major, I chose 
C = A^T B + C as the transpose settings because it means that the innermost
loop (K) accesses contiguous memory.  I take it you chose B^T (in col-major
terms) 'cause it puts your 6 elts of the dot product contiguous instead?
Do you think having to grab the elts of the dot product from 
*A, A[NB], A[NB2], A[NB3], A[NB4], A[NB5] will prove a performance killer
(with the priviso that the next step of K will use subsequent elts)?

>Six scalarproducts sequentally (30x30 matrices): 30*2(a+b)*6+6(c) loads + 6(c) stores => 372 mem accesses
>Six scalarproducts simultanously               : 30(a)+30*6(b)+6(c) loads + 6(c) stores => 222 mem accesses  

NB=30, eh?  ATLAS might be happier with NB=28, since that keeps it a multiple
of the cacheline boundary; on the other hand, with your prefetching, perhaps
it will not matter . . .

>In the case I don't find a solution to adapt the algorithms to ATLAS' requirements, is there a way that I can
>use C=A*B+C (row major) as kernel?

I'm afraid not, at least without me rewriting quite a bit of the ATLAS
internals.  I'm  not sure when I'd have the time to do that . . .
Even if you don't achieve the full speed of your row-major kernel, you
can surely improve on ATLAS's paltry 1.4;  Your 1.9 is the upper bound
maybe, but any improvement to ATLAS is still an improvement . . .

>At the end of this mail you find some results of Alex full DEGMM

Very good indeed.  The impressive thing is that it looks like it
would make the Athlon beat the P4 (for ATLAS) clock-for-clock, in a fairly
decisive manner.  I'm sure you've seen my messages about it, but presently
the P4 is presently ATLAS's flops king, with a roughly 2Gflop DGEMM on a
1.5Ghz P4 . . .

Regards,
Clint