[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sgemm questions

Hi Clint!

R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Camm,
> >lda=ldb=KB, yes, that's right.  It appears that I've also assumed that
> >all dimensions are KB, i.e. MB=NB=KB, as described in the doc for the
> >L1 kernel case.  It would be trivial to separate out MB and NB if
> >needed -- just two macros need to be changed.  Please let me know if
> >you want this edit.  The input array dimension args are ignored.
> OK, I need to get a new developer release out with the cleanup stuff in.
> That is where this need for MB != NB != KB is coming from.  The kernel you
> already have is fine for the non-cleanup case, and I would recommend leaving
> it alone for that.  If you have the time, I think it would be worth doing to
> produce a second kernel, modified from the first, so that M and N are passed
> in as parameters to the routine, rather than fixed at MB and NB.  This has
> not caused serious slowdown on any platform I've tested so far (since these
> dimensions do not effect lda/ldb and the innermost loop), and it allows the 
> routine to be used for M and N loop cleanup without compiling NB different
> instantiations of the routine (leading to code bloat, and reducing performance
> through repetive instruction load).  For the K-cleanup, it *is* often necessary
> to use compile-time KB, since it controls lda and ldb, as well as the inner
> loop, especially on Intels, where the inner loop needs heavy unrolling.  So,
> a second kernel taking M & N as input kernels, and then probably fixing K to
> KB would be a good cleanup (obviously, if it didn't kill performance, taking
> K as an input parameter would be great, but I don't think it is doable).
> The idea would be to use the input file's flag variable to indicate your first
> routine is to be used for kernel only, and the second to be used for cleanup
> only.

OK, it seems as though if we can insist that KB be a multiple of 4 (2
for complex), we can even input kb without too much trouble.  Please
let me know if this is workable.  What I'm unsure of is whether to
write a 1x1xkb cleanup kernel, or something that can branch from
2x1xkb, 1x2xkb, to 1x1xkb.  Do you think this is worth it?  How will
kb%4!=0 work?  It can be done of course, but the normal fpu needs to
be used in this case, and there may be issues of getting into and out
of xmm mode.

> You can still insist that M be a multiple of 2, for instance, though this
> will mean that your cleanup will only be called when M%2 == 0, and the 
> generated cleanup will be called otherwise . . .
> Normally, you can leave the cleanup to ATLAS's generated cleanup, but your
> kernel is 1.8 times faster than the generated code, so cleanup could really
> hurt your performance . . .

OK.  What percent of peak is this, BTW?

> Thanks,
> Clint
Take care,

Camm Maguire			     			camm@enhanced.com
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah