[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: worth a flier


>Oh, I thought all the time that ATLAS' mm uses the straightforward "split one dot product into five and
>add the results" strategy which AMD recommended when we started writing our DGEMM. Can
>I ask you how you came to the idea to crunch five dot products simultaneously? Or does
>ATLAS find this solution automatically? Does ATLAS do it on all architectures in this way? 

ATLAS varies the number of dot products is does on differing platforms
(eg, 5 for an Athlon, 16 for a alpha, etc), but it always does it this way.
I've never written gemm to split one dot into five, 'cause this would
effectively reduce your K dimension by 5, thus raising all your pipeline
start/stop costs . . .

>I personally don't think one method is superior to the other. But if so, I can't say
>why because it is a hardware question. While I now a little about SDRAMs I don't know
>anything about SRAMs and which access pattern they like. I think the performance killer
>of ATLAS on Athlon is just the impossibility to create perfectly optimized kernel code from
>a higher programming language. One MUST use assembly to achieve near peak results.

I also think this access pattern should be fine (after all, I chose it thinking
it was the best).  If the number of dot products > than the number of
outstanding cache misses your L1 can serve, you may experience a decline;
otherwise, I think it should be faster . . .

>BTW, if both methods above are performance killers, there is one solution for it: One can
>create a perfect access pattern with a modified copying routine. Who says that the nb x nb
>blocks must look like they were cut out of the source matrix? One can bring the elements
>of one block into any wanted order to create a perfect access pattern...But this seems
>not suitable for ATLAS. 

This is what I thought ATLAS already did.  Again, to my pea-brain, 
C += A^T B *is* the optimal access pattern, which is why I copy to it . . .
> NB=30, eh?  ATLAS might be happier with NB=28, since that keeps it a multiple
> of the cacheline boundary; on the other hand, with your prefetching, perhaps
> it will not matter . . .
> >In the case I don't find a solution to adapt the algorithms to ATLAS' requirements, is there a way that I can
> >use C=A*B+C (row major) as kernel?

>I think I'll do it with 30, because I use 6 simultaneous dot products (sorry that I
>used the word "scalar product" in my last letter. In German we call it "Standard Skalarprodukt"
>or "Euklidisches Skalarprodukt" and I sometimes tranfer some German words into English).
>If using 28/6 I would waste some cycles (maybe I'll try both)

Right.  How about 36?  It's a multiple of 4 and 6, and should fit in cache.
Did you guys find that NB > 30 causes does not stay in cache (and if you did,
was it using contiguous A & B)?

>Programming the SSE2 unit is no fun. As long as Intel does not tell us how many
>uops each instruction generates, it is impossible to write good code. Once I have hit

I'm willing to conceed that SSE doesn't look like a time on the town, but I
would be hard pressed to find the assembly language programming technique
I would describe as "fun".  One of the many reasons I put in the user
contribution stuff was because I realized I just wasn't man enough to
face assembler coding on any scale :)

>Just to make sure that I did not think wrong (row-major vs col-major confuses me everytime).
>The code I will write must look like this:

How about we remove the ambiguity.  Here's essentially the code that ATLAS
uses, with some loop editing to make it readable:

   for (j=0; j < NB; j++)
      for (i=0; i < MB; i++)
         rC0_0 = *pC0;   rC1_0 = pC0[1]; rC2_0 = pC0[2];
         rC3_0 = pC0[3]; rC4_0 = pC0[4];
         for (k=0; k < KB; k++)
            rA0 = *pA0; rB0 = *pB0;
            rA1 = pA0[KB]; rA2 = pA0[KB2]; rA3 = pA0[KB3]; rA4 = pA0[KB4];
            pA0++; pB0++;

            rC0_0 += rA0 * rB0;
            rC1_0 += rA1 * rB0;
            rC2_0 += rA2 * rB0;
            rC3_0 += rA3 * rB0;
            rC4_0 += rA4 * rB0;
         *pC0 = rC0_0; pC0[1] = rC1_0; pC0[2] = rC2_0;
         pC0[3] = rC3_0; pC0[4] = rC4_0;
         pC0 += 5;
         pA0 += KB*MB - KB;
         pB0 -= KB;
      pC0 += ldc - MB;
      pA0 -= MB*NB;
      pB0 += KB;

Notice that, since I am operating on a column-major C, I choose the
N-M-K loop ordering, and grab multiple elts of A instead of B.  I think the
one you describe is M-N-K, with mult elts of B.  I think the one you describe
is better for row-major, and the one above is better for col-major, but that
they are exactly equivalent to produce . . .

>(1) can't prefetch "lost" cachelines of C
>(2) exchanging c's requires massive pointer arithmetics

(2) goes away for sure if you use the algorithm shown above.  I'm not exactly
sure what you mean by (1), so I don't know about it . . .