[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sgemm questions

>>OK, I need to get a new developer release out with the cleanup stuff in.
>>That is where this need for MB != NB != KB is coming from.  The kernel you
>>already have is fine for the non-cleanup case, and I would recommend leaving
>>it alone for that.  If you have the time, I think it would be worth doing to
>>produce a second kernel, modified from the first, so that M and N are passed
>>in as parameters to the routine, rather than fixed at MB and NB.  This has
>>not caused serious slowdown on any platform I've tested so far (since these
>>dimensions do not effect lda/ldb and the innermost loop), and it allows the 
>>routine to be used for M and N loop cleanup without compiling NB different
>>instantiations of the routine (leading to code bloat, and reducing performance
>>through repetive instruction load).  For the K-cleanup, it *is* often necessary
>>to use compile-time KB, since it controls lda and ldb, as well as the inner
>>loop, especially on Intels, where the inner loop needs heavy unrolling.  So,
>>a second kernel taking M & N as input kernels, and then probably fixing K to
>>KB would be a good cleanup (obviously, if it didn't kill performance, taking
>>K as an input parameter would be great, but I don't think it is doable).
>>The idea would be to use the input file's flag variable to indicate your first
>>routine is to be used for kernel only, and the second to be used for cleanup

>OK, it seems as though if we can insist that KB be a multiple of 4 (2
>for complex), we can even input kb without too much trouble.  

The first thing to know is that there is no point in making K an input
parameter unless you also use lda and ldb as input parameters (since they
are set to KB).  In my own Pentium codes, not unrolling the K loop and/or not
having a known lda/ldb seems to be a performance killer, such that it is
worthwhile compiling NB different K-cleanup cases.  If this is not true
for you, then taking K and lda and ldb as an run-time input is really the
way to go . . .

>What I'm unsure of is whether to
>write a 1x1xkb cleanup kernel, or something that can branch from
>2x1xkb, 1x2xkb, to 1x1xkb.  Do you think this is worth it?  

I always wrestle with this very question, but in this case I think a 1x1xkb
for M-loop cleanup, since it is not going to be a big part of the computation
for anything other than M=2.  Note that the user cleanup only has *one*
non-fixed dimension at a time.  Cases where two or more dimensions are
less than NB are always handled by ATLAS (this is essentially needed to
avoid having NB^3 user contributed kernels to test and time, which have very
little effect on performance).  So for each cleanup, remember that 
if MB != NB, KB==NB, for instance.

If you find it easier, you can write two different routines, one for M-cleanup
and one for K-cleanup.  N-cleanup could be handled by the M-cleanup if you
take N as an input parameter (since it is not unrolled anyway), or you
could have N as in input parameter to your present full kernel . . .

>How will
>kb%4!=0 work?  It can be done of course, but the normal fpu needs to
>be used in this case, and there may be issues of getting into and out
>of xmm mode.

Well, if your K-cleanup code works only for kb%4, the generated code will
be used for all other cases.  In this case, 3/4 of K-cleanup will be handled
by slow generated code, assuming input size is random . . .

As for making your code handle K%4, this should not be that difficult 
(says the man with no understanding of what is really happening :); It
seems to me from your comments above that the problem is you forsee needing
to switch from SSE to regular mode in the innerloop, a definite no-no.
I see two ways around this: most simply, load your SSE vectors with zeros
in order to pump the K cleanup up to 4, and do what you are presently doing
with SSE.  The extra flops shouldn't matter for crap until the K-loop is
very short, where you are unlikely to get any win with SSE using K as the
inner loop anyway.  Alternative number two is do your present code, leaving
a 0-3 K remainder, and call a crap implementation to finish off the K loop.
This will require two passes over C, so it is not as good as the first option,
but probably still beats using the 1.8 times slower generated cleanup . . .
If necessary, you can supply two K cleanups: a fast one handling K % 4, and
a slower one handling truly arbitrary cleanups; the cleanup detector is
smart enough to do that . . .

Anyway, unless SSE is different than normal fpu, I think we'll need to take
KB as a compile-time variable, in which case the padding of the vector
with zeros should be trivial and have no runtime overhead (no need to do
if, 'cause you can do #if), and then the cleanup stuff will compile your
K-cleanup NB times . . .

I think all of this will be clearer when I get a developer release out with
this stuff in it, so you can play with it directly, and I am hoping to do
that this week . . .

>> Normally, you can leave the cleanup to ATLAS's generated cleanup, but your
>> kernel is 1.8 times faster than the generated code, so cleanup could really
>> hurt your performance . . .
>OK.  What percent of peak is this, BTW?

OK, I'm getting around 760Mflop for your kernel.  The generated kernel
gets around 420Mflop.  Both these numbers are for the kernel only (i.e.,
this is not the speed of the full gemm, which may be less due to overheads,
or greater do to CacheEdge's L2 blocking), on my 500Mhz PIII laptop.
So, you are getting 152% theoretical peak if you were using the normal
FPU, but only 38% of SSE peak.  Of course, I doubt the theoretical peak of
memory fetch will not support the SSE peak MFLOP, so even a perfect flop code
will not get the 2Gflop peak . . .

However, your code appears to be fpu-bound.  I ran your kernel with
no cache flushing (this means A will be pretty much kept in L1, and B and
C will be either in L1 or L2), and the code only went up to 800Mflop . . .

If you want to see what happens with no cache flushing, simply add:
to your ummcase call . . .
I use this trick to find out how good my pipelining is, since it should
drastically reduce memory costs.  Of course, it is then not a good indicator
of true GEMM performance, so that's why ATLAS normally flushes the caches
in the way it does (a routine optimized for no cache flushing is not
necessarily optimized for real use) . . .