[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sgemm questions



Camm,

>lda=ldb=KB, yes, that's right.  It appears that I've also assumed that
>all dimensions are KB, i.e. MB=NB=KB, as described in the doc for the
>L1 kernel case.  It would be trivial to separate out MB and NB if
>needed -- just two macros need to be changed.  Please let me know if
>you want this edit.  The input array dimension args are ignored.

OK, I need to get a new developer release out with the cleanup stuff in.
That is where this need for MB != NB != KB is coming from.  The kernel you
already have is fine for the non-cleanup case, and I would recommend leaving
it alone for that.  If you have the time, I think it would be worth doing to
produce a second kernel, modified from the first, so that M and N are passed
in as parameters to the routine, rather than fixed at MB and NB.  This has
not caused serious slowdown on any platform I've tested so far (since these
dimensions do not effect lda/ldb and the innermost loop), and it allows the 
routine to be used for M and N loop cleanup without compiling NB different
instantiations of the routine (leading to code bloat, and reducing performance
through repetive instruction load).  For the K-cleanup, it *is* often necessary
to use compile-time KB, since it controls lda and ldb, as well as the inner
loop, especially on Intels, where the inner loop needs heavy unrolling.  So,
a second kernel taking M & N as input kernels, and then probably fixing K to
KB would be a good cleanup (obviously, if it didn't kill performance, taking
K as an input parameter would be great, but I don't think it is doable).
The idea would be to use the input file's flag variable to indicate your first
routine is to be used for kernel only, and the second to be used for cleanup
only.

You can still insist that M be a multiple of 2, for instance, though this
will mean that your cleanup will only be called when M%2 == 0, and the 
generated cleanup will be called otherwise . . .

Normally, you can leave the cleanup to ATLAS's generated cleanup, but your
kernel is 1.8 times faster than the generated code, so cleanup could really
hurt your performance . . .

Thanks,
Clint