[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: worth a flier
Julian,
>Oh, I thought all the time that ATLAS' mm uses the straightforward "split one dot product into five and
>add the results" strategy which AMD recommended when we started writing our DGEMM. Can
>I ask you how you came to the idea to crunch five dot products simultaneously? Or does
>ATLAS find this solution automatically? Does ATLAS do it on all architectures in this way?
ATLAS varies the number of dot products is does on differing platforms
(eg, 5 for an Athlon, 16 for a alpha, etc), but it always does it this way.
I've never written gemm to split one dot into five, 'cause this would
effectively reduce your K dimension by 5, thus raising all your pipeline
start/stop costs . . .
>I personally don't think one method is superior to the other. But if so, I can't say
>why because it is a hardware question. While I now a little about SDRAMs I don't know
>anything about SRAMs and which access pattern they like. I think the performance killer
>of ATLAS on Athlon is just the impossibility to create perfectly optimized kernel code from
>a higher programming language. One MUST use assembly to achieve near peak results.
I also think this access pattern should be fine (after all, I chose it thinking
it was the best). If the number of dot products > than the number of
outstanding cache misses your L1 can serve, you may experience a decline;
otherwise, I think it should be faster . . .
>BTW, if both methods above are performance killers, there is one solution for it: One can
>create a perfect access pattern with a modified copying routine. Who says that the nb x nb
>blocks must look like they were cut out of the source matrix? One can bring the elements
>of one block into any wanted order to create a perfect access pattern...But this seems
>not suitable for ATLAS.
This is what I thought ATLAS already did. Again, to my pea-brain,
C += A^T B *is* the optimal access pattern, which is why I copy to it . . .
> NB=30, eh? ATLAS might be happier with NB=28, since that keeps it a multiple
> of the cacheline boundary; on the other hand, with your prefetching, perhaps
> it will not matter . . .
>
> >In the case I don't find a solution to adapt the algorithms to ATLAS' requirements, is there a way that I can
> >use C=A*B+C (row major) as kernel?
>I think I'll do it with 30, because I use 6 simultaneous dot products (sorry that I
>used the word "scalar product" in my last letter. In German we call it "Standard Skalarprodukt"
>or "Euklidisches Skalarprodukt" and I sometimes tranfer some German words into English).
>If using 28/6 I would waste some cycles (maybe I'll try both)
Right. How about 36? It's a multiple of 4 and 6, and should fit in cache.
Did you guys find that NB > 30 causes does not stay in cache (and if you did,
was it using contiguous A & B)?
>Programming the SSE2 unit is no fun. As long as Intel does not tell us how many
>uops each instruction generates, it is impossible to write good code. Once I have hit
I'm willing to conceed that SSE doesn't look like a time on the town, but I
would be hard pressed to find the assembly language programming technique
I would describe as "fun". One of the many reasons I put in the user
contribution stuff was because I realized I just wasn't man enough to
face assembler coding on any scale :)
>Just to make sure that I did not think wrong (row-major vs col-major confuses me everytime).
>The code I will write must look like this:
How about we remove the ambiguity. Here's essentially the code that ATLAS
uses, with some loop editing to make it readable:
for (j=0; j < NB; j++)
{
for (i=0; i < MB; i++)
{
rC0_0 = *pC0; rC1_0 = pC0[1]; rC2_0 = pC0[2];
rC3_0 = pC0[3]; rC4_0 = pC0[4];
for (k=0; k < KB; k++)
{
rA0 = *pA0; rB0 = *pB0;
rA1 = pA0[KB]; rA2 = pA0[KB2]; rA3 = pA0[KB3]; rA4 = pA0[KB4];
pA0++; pB0++;
rC0_0 += rA0 * rB0;
rC1_0 += rA1 * rB0;
rC2_0 += rA2 * rB0;
rC3_0 += rA3 * rB0;
rC4_0 += rA4 * rB0;
}
*pC0 = rC0_0; pC0[1] = rC1_0; pC0[2] = rC2_0;
pC0[3] = rC3_0; pC0[4] = rC4_0;
pC0 += 5;
pA0 += KB*MB - KB;
pB0 -= KB;
}
pC0 += ldc - MB;
pA0 -= MB*NB;
pB0 += KB;
}
Notice that, since I am operating on a column-major C, I choose the
N-M-K loop ordering, and grab multiple elts of A instead of B. I think the
one you describe is M-N-K, with mult elts of B. I think the one you describe
is better for row-major, and the one above is better for col-major, but that
they are exactly equivalent to produce . . .
>(1) can't prefetch "lost" cachelines of C
>(2) exchanging c's requires massive pointer arithmetics
(2) goes away for sure if you use the algorithm shown above. I'm not exactly
sure what you mean by (1), so I don't know about it . . .
Cheers,
Clint