[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: ruheejih@calvados.zrz.tu-berlin.de*Subject*: Re: worth a flier*From*: R Clint Whaley <rwhaley@cs.utk.edu>*Date*: Thu, 31 May 2001 12:01:32 -0400 (EDT)*Cc*: atlas-comm@cs.utk.edu

Julian, I cc atlas-comm; I think other developers might be interested in our discussions on gemm . . . Most of what I say is discussed in more detail in ATLAS/doc/atlas_over.ps, so I will supplement my description with a page referenge such as (pp 11-13) for interested parties who don't understand my quick and dirty explanations . . . >Correct - IPC = (fp) instructions per cycle cool. Needed to be sure you weren't also including the integer component of gemm (loops and pointer arith) in the IPC . . . >OK, the prefetches for gemm can be done with a simple calculation within mm. But is it also >true when using the mm kernel in other routines than gemm? Yes. The other Level 3 BLAS all call the full gemm, not the kernel, and they will therefore have the full gemm's access patterns . . . >Interesting that ATLAS does not copy C! Without having taken a look on the ATLAS documentation, >I ask myself if ATLAS does not create worse cacheline conflicts with this strategy, exactly >in that matrix, whose blocks must stay in the caches for the longest time (blocks of C). >But I am sure that you have tested, if the cost of copying C dominates the possible loss caused >by cacheline conflicts... I think we need to get some common vocabulary to discuss this. Gemm needs to be written with 3 loops. If C is MxN, and A and B are MxK and KxN, then let us call the loops over a particular dimension M, N, and K loop. In all ATLAS gemm (both kernel and full) implementations K is the innermost loop. For the kernel, which is M=N=K=NB, this means that the access of C is only 2*NB*NB, while the access of A and B are NB*NB*NB each, so C is a low order term on cost. Also, LAPACK is a big user of the BLAS, and they call gemm a lot as a rank-K update (think of M=1000, N=1000, K=NB). In this configuration, the cost of reading and writing C is one of the main costs of GEMM, and if you copy, you double it, which is a killer; These two factors together (C low-order term, and copy killing perf of rank-NB update) are why ATLAS does not guarantee copying C. ATLAS actually *does* buffer C into a contiguous NBxNB workspace when K is sufficiently large . . . (pp 11-13) There are two l1 caching strategies that can yield good performance, one input matrix block, and a few rows/cols of other input matrix fills cache, and all 3 blocks (A, B, C) fill cache (pp. 18). On the Athlon, ATLAS presently gets better performance with the first of these, so C is essentially kept in L2 . . . >Now something serious: I am not sure if I can adopt our algorithm to C=At*B+C (Fortran column-major) as >required by ATLAS. Our algorithm was created for C=A*B+C (C row-major). I assume all matrices are row-major? If so, then in col-major terms, you are doing C^T += B^T A^T, so we have a mismatch on access of C and the second input matrix. See below for more details . . . >If you wonder why - we >are crunching six scalar products simultanously, which is perfectly suitable for the x86 fp stack, >reduces L1 bandwidth pressure (~2 dploads/cyc -> ~1.2 dploads/cyc), causes shorter instructions and which brings us row-access on B >(at least for six iterations. Copying B in a slighly different results in full row access of B). OK, just to be sure, when you say scalar products, this is the Level 1 BLAS operation DDOT? If so, this is exactly what ATLAS's generated kernel is doing: 5 simultaneous dot products (6 actually gives almost the same performance). For any gemm with K as the inner loop, you do some number of simultaneous dot products. if M or N is the inner loop, you do some number of simultaneous AXPYs (Level 1 operation AXPY). So, with K as the inner loop, and matrices column major, I chose C = A^T B + C as the transpose settings because it means that the innermost loop (K) accesses contiguous memory. I take it you chose B^T (in col-major terms) 'cause it puts your 6 elts of the dot product contiguous instead? Do you think having to grab the elts of the dot product from *A, A[NB], A[NB2], A[NB3], A[NB4], A[NB5] will prove a performance killer (with the priviso that the next step of K will use subsequent elts)? >Six scalarproducts sequentally (30x30 matrices): 30*2(a+b)*6+6(c) loads + 6(c) stores => 372 mem accesses >Six scalarproducts simultanously : 30(a)+30*6(b)+6(c) loads + 6(c) stores => 222 mem accesses NB=30, eh? ATLAS might be happier with NB=28, since that keeps it a multiple of the cacheline boundary; on the other hand, with your prefetching, perhaps it will not matter . . . >In the case I don't find a solution to adapt the algorithms to ATLAS' requirements, is there a way that I can >use C=A*B+C (row major) as kernel? I'm afraid not, at least without me rewriting quite a bit of the ATLAS internals. I'm not sure when I'd have the time to do that . . . Even if you don't achieve the full speed of your row-major kernel, you can surely improve on ATLAS's paltry 1.4; Your 1.9 is the upper bound maybe, but any improvement to ATLAS is still an improvement . . . >At the end of this mail you find some results of Alex full DEGMM Very good indeed. The impressive thing is that it looks like it would make the Athlon beat the P4 (for ATLAS) clock-for-clock, in a fairly decisive manner. I'm sure you've seen my messages about it, but presently the P4 is presently ATLAS's flops king, with a roughly 2Gflop DGEMM on a 1.5Ghz P4 . . . Regards, Clint

**Follow-Ups**:**Re: worth a flier***From:*Julian Ruhe <ruheejih@calvados.zrz.tu-berlin.de>

- Prev by Date:
**=?EUC-KR?B?uLbB9ri3IML5vbrC+b26ISEgIL26xLOzyrChILD4wqUhIQ===?=** - Next by Date:
**Re: worth a flier** - Prev by thread:
**RE: shared library atlas** - Next by thread:
**Re: worth a flier** - Index(es):