[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Intel SSE and Atlas.. some questions.


>An easy question to start the ball rolling... is there something
>special I have to do to to get the top level make install to search
>the code I have specified in the scases.dsc file? I tried an install
>but it seems to have ignored my code. How can I check that my code is
>being timed during the installation?

Modification of scases.dsc should be enough.  However, the install process
runs the tester, so if it doesn't pass all tests (for instance, if you supply
only the BETA=1 case for now), it will reject the kernel.  To see if it runs
your case, scope the output file in 
where # is the entry number, starting from 1, in the scases.dsc (i.e., if your
scases.dsc contains only 1 kernel, there should be a file 
which has the speed, in MFLOPS, that your kernel clocked in at.

Perhaps the problem is that ATLAS is calling with a crappy block size for you
(say 56), and ATLAS's sgemm is faster than yours for such a block size?
Right now, the search runs as normal, find the best generated case, and then
tries the user cases using an NB near ATLAS's (unless you insist on an NB in
the file).  If your code beats ATLAS's, it will then try more blocking
factors.  I don't think this is the case with your code's speed, but I include
the explanation just in case . . .

Anyway, the matmul user kernel search is far from well-tested, so there is
certainly the chance I've screwed it up.  Let me know what you find . . .

>> cleanup exclusively), and I never found a platform where larger blocking
>> factors were significantly better . . .
>Ahh.. well... my timings so far look like:
>make ummcase nb=64 pre=s mmrout=../CASES/emmerald/matmult.c beta=0 
>sNB=64, ldc=64, mu=4, nu=4, ku=1, lat=4: time=0.210000, mflop=696.554057
>make ummcase nb=160 pre=s mmrout=../CASES/emmerald/matmult.c beta=0 
>sNB=160, ldc=160, mu=4, nu=4, ku=1, lat=4: time=0.170000, mflop=819.200000 

>make ummcase nb=320 pre=s mmrout=../CASES/emmerald/matmult.c beta=0
>sNB=320, ldc=320, mu=4, nu=4, ku=1, lat=4: time=0.160000, mflop=819.200000

>So I would make a case for allowing larger kernels... perhaps a
>question asked as part of the search set up?

I agree we may need some larger blocking factors, but 320 is clearly too
large.  The majority of calls to gemm would be in the cleanup for such a
code (i.e., even gemm calls from a large factorizations, say on problem
size 2000, would still be in cleanup for the vast majority of the algorithm).
Also, you can slow down the copy quite a bit if the number of columns in
your block does not fit into the TLB . . .

>> At the moment, ATLAS will use its own cleanup.  However, I defined
>> ....
>> door, and I figured if no one ever contributed any GEMM kernels, or
>> using ATLAS's cleanup wasn't too bad, there was no need for me to
>> spend anymore time in ATLAS install hell . . .
>Good point. I think having SSE clean up would improve things a lot in
>my case though.

I agree, and will add the required functionality as soon as I can.  Once
your code is ready, that will provide me with a good test case . . .

>> Can you explain a little bit about your data format?  I thought Transpose,
>> Notranspose block-major was about as pointer-efficient storage as you
>> could do for any matmul with K as the inner loop.  You aren't, by chance,
>> using M or N as your innermost loop?
>You asked for it....
>K is the inner loop. 
>The data format is very close to Transpose, Notranspose block-major.
>In fact it is, but with small, non-square, blocks.
>The A matrix is used in the same way (so I am not re-buffering it),
>but the B matrix is rebuffered as follows:
>A (k X 5) block of B is sucked out of the notrans column-major block
>that ATLAS supplies to the sse matmult. In memory, (using B{row}{column}
>notation) this looks like:
>B00 B10 B20 B30 B40 B50 B60 B70 ... Bk0 
>B01 B11 B21 B31 B41 B51 B61 B71 ... Bk1 
>B02 B13 B22 B32 B42 B52 B62 B72 ... Bk2 
>B03 B14 B23 B33 B43 B53 B63 B73 ... Bk3 
>B04 B15 B24 B34 B44 B54 B64 B74 ... Bk4 
>I rebuffer this to be optimal for pointer arithmetic when doing 5 
>simultaneous inner products along the k dimension:
>B00 B10 B20 B30 
>B01 B11 B21 B31 
>B02 B12 B22 B32 
>B03 B13 B23 B33 
>B04 B14 B24 B34
>B40 B50 B60 B70 
>B41 B51 B61 B71 
>B42 B52 B62 B72 
>>B43 B53 B63 B73 
>B44 B54 B64 B74
>This is a re-blocking for registers in a block-major NoTrans 4X5
>format.  The K dimension (4) is determined by the 
>degree of parallelisation in the SSE instructions. The M dimension (5)
>is is determined by the number of SSE registers (8). The innermost 
>assembler loop takes 4 of these blocks (4X unrolling along K), and a
>1X16 row of the A matrix. So the little k above used in the
>rebuffering is a multiple of 16.
>The best situation would be ATLAS directly calling this assembler
>kernel with contiguous 1X16 panels of A, and contiguous 4X5
>panels of B. I had to write the code to form a square panel of
>dimension NB to the format above.
>I can almost optimally re-write the asm to use NBX5 panels of B,
>and avoiding re-buffering B, but this blows out the limit of 
>8bit indexing instuctions for the Intel ISA (actually a NB=16 block
>will work, but 32+ will not). Basically the rebuffering
>is all for the want of a bit!
>Anyway, I hope that description of the blood and guts of Emmerald
>makes some sense and you can see the problem with the rebuffering. I
>will see how it goes with NB=16 after re-writing the asm core. That
>might be sufficient..

Interesting.  I will need to reread this several times to fully
understand it (I will try to take the required time as soon as I
finish the gigantic programming problem I'm currently in the middle of).
My guess is that the SSE instructions themselves operate on these
small vectors, necessitating the contiguous storage of B's dot prod
contribution?  I guess you realize you don't need pointers to access 
separate columns once in ATLAS's block-major format?  I.e., B[NB] gives
you column 2, where NB is a macro that resolves, for instance, to B[64],
so the additional costs have to do with the elements not being contiguous,
(which is usually not a big deal since you are about to use the elements
in a few moments).

My own temptation is to give the two choices: kernel only, or drop in gemm.
The mixed case of changing some internal routines and providing the kernel
seems problematic, though I wouldn't rule it out.