[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SSE Level 3 drop in gemm


R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Doug,
> >I've (finally) found the time to finish adding my SSE sgemm into ATLAS
> >as a drop in kernel. Atlas timing says it runs up to 2.39 time faster
> >than ATLAS when it's computing the cross over points. Two questions:

This is great!  Thanks for all the work, you guys!  Congratulations!

I had been working on a L3 sgemm *kernel*, (not a complete
implementation), which I'm sure won't perform as well as Doug's, and
which is thus now obsolete.  The sgemm I had been working on gives so
far about ~550 MFLOPS with the 'make ummcase' -- the best that atlas
previously found was 223 (res/sMMRES).  These are compiled with -g, so
I don't know what the real speedup would be.  PIII 450Mhz.  xl3blastst
from a previous optimized (i.e. no -g) build gives around 370.  

In any case, I thought I might turn this into a complex gemm
contribution.  Reading the docs, it seems one only needs double ldc?
Will atlas call the kernel repeatedly for all real/imaginary matrix

A few thoughts:

1)  One ought to be able to do better with a true complex kernel than
    calling the routines 4 times, no?
2)  The xsmmtst always doubles ldc, even with single real precision.
    This makes it difficult to fully capitalize on he compile-time
    constant nature of the dimensions (i.e. one must read ldc runtime
    if one wants a routine that will past both the tester and the
3)  I found it useful to also define NB4,MB4, and KB4 in emit_mm.c,
    for obvious (In the case of SSE) reasons. 
4)  Believe it or not, prefetch added about 50-80 MFLOPS on a base of
    450.  Still, I don't imagine that would warrant double precision
5)  xmmsearch still reports the old atlas kernel as the best to
    stdout, at 223, but adds mine at the bottom of res/sMMRES.
    Haven't tried installing the whole library yet, but I had doubts
    on whether this would result in my kernel being selected.
6)  These are a *lot* easier to write, IMHO, than the l2 stuff.  
7)  I remember reading that the AMD 3dNow! had the same kni throughput
    as the PIII, even though its mm registers were half as big.
    Something else was doubled, but I can't find it now.  I know there
    are still only 8 mm regs.  Anyone know the answer?  Should be easy
    to make Athlon stuff from what we have.
8)  Sure would be nice, since a copy is being done anyway, to align
    data to 16 bytes.  Anywhere I can change this locally just to see
    what it adds to the performance?

Take care,

> Great news!  I was hoping we'd have some L3 SSE stuff before release . . .
> Is it a kernel or a complete GEMM implementation?  I'm not sure from the
> info below . . .
> >It compiles fine using the documented instructions for forcing
> >compilation, but it doesn't seem to automatically detect it during a
> >normal compilation. For this to work I am guessing all I need to do is 
> >add the correct UMMdir definition to ATLAS/Make.<arch> before starting the 
> >./make arch=<arch> install? There is an ATLAS/makes/Make.goto. Do I
> >need one of these?
> Depends on whether you've got a kernel or a GEMM replacement.  For a kernel,
> you shouldn't need to fool with all this stuff. . . .
> >2) What's the best way to send in the changes? Complete tar file, tar
> >file with the changes, patch file?
> I like a tarfile with just your codes best . . .
> Thanks,
> Clint

Camm Maguire			     			camm@enhanced.com
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah