[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SSE Level 3 drop in gemm



>I had been working on a L3 sgemm *kernel*, (not a complete
>implementation), which I'm sure won't perform as well as Doug's, and
>which is thus now obsolete.  The sgemm I had been working on gives so
>far about ~550 MFLOPS with the 'make ummcase' -- the best that atlas
>previously found was 223 (res/sMMRES).  These are compiled with -g, so
>I don't know what the real speedup would be.  PIII 450Mhz.  xl3blastst
>from a previous optimized (i.e. no -g) build gives around 370.  

If you wind up with a kernel, turn it in anyway.  Kernels are preferable to
complete gemm implementations in several ways.  One important difference is
that they are selectable at install time, rather than needing to be hardwired
into config for their use: this means forthcoming chips can use them
automatically, whereas a complete gemm will not be used until after I have
gotten access to the new platform, personally verified its superiority on
that platform, programmed that into config, and had a new release.  On the 
flip side of this coin, complete GEMM's can also be a bad idea.  Assume someday
down the road that gcc started generating SSE instructions, and ATLAS's
code generator took a big jump in performance, overtaking the complete
SSE GEMM in performance: I would have to wait until the next release
of ATLAS to turn off the user of SSE/SGEMM, rather than having ATLAS do it
automagically.  The complete GEMM technique I added for essentially two cases:
(1) The optimizations needed are not compatible with a kernel strategy
(2) The user has a pre-existing ATLAS code, which would be too much of a
    pain to adapt

>In any case, I thought I might turn this into a complex gemm
>contribution.  Reading the docs, it seems one only needs double ldc?

You double ldc, and access C with stride 2 in columns (atlas_contrib.ps shows
an example of this) . . .

>Will atlas call the kernel repeatedly for all real/imaginary matrix
>combos?  

Yes.

>1)  One ought to be able to do better with a true complex kernel than
>    calling the routines 4 times, no?

Actually, not really.  You might think that complex's register reuse would
be superior to real's, but this turns out not to be the case.  With the
A and B matrices copied to contiguous arrays, the only real cost of this
approach is the stride-2 access of C, which is a very low order
term (O(N^2)) . . .

If we had stride-2 access on A and B (O(N^3) access), as you would get if
you didn't have the data copy, this approach would blow . . .

It might suffer slightly on the L1 reuse front, but I don't think it's a big
deal (essentially, I think you hit L1 twice instead of once, but that is
pretty much washed out by the cost of loading from L2/main anyway).

The idea is to make it trivial to take a real code, and make a complex code
from it (only change in access of C).  This is another reason not to give up
on your real kernel: an optimized real kernel is essentially an optimized
complex kernel . . .

>2)  The xsmmtst always doubles ldc, even with single real precision.
>    This makes it difficult to fully capitalize on he compile-time
>    constant nature of the dimensions (i.e. one must read ldc runtime
>    if one wants a routine that will past both the tester and the
>    timer.) 

That's why the macro NB2 exists: it is just NB*2 as a constant . . .
Is this what you are talking about?

>3)  I found it useful to also define NB4,MB4, and KB4 in emit_mm.c,
>    for obvious (In the case of SSE) reasons. 
What are these macros?  NB*4?

>4)  Believe it or not, prefetch added about 50-80 MFLOPS on a base of
>    450.  Still, I don't imagine that would warrant double precision
>    kernels?  

Well, potentially prefetch _could_ be more of a boon for double than single
(since the fetch cost is twice that of single), so I would not say that.
Particularly if you could do something relatively easy like taking the
generated code and add some instructions (I'm not saying this would work,
mind you) . . .  That's already a ~10% improvement, which is nothing to
sneeze at . . .

>5)  xmmsearch still reports the old atlas kernel as the best to
>    stdout, at 223, but adds mine at the bottom of res/sMMRES.
>    Haven't tried installing the whole library yet, but I had doubts
>    on whether this would result in my kernel being selected.

It should, but I also have doubts as to the robustness of the present code.
I am changing that right now anyway, in order to add the ability for user
cleanup, and user selection of compiler/flags, so rest assured if you provide
a kernel that beats the generated kernel, by the release I'll make sure ATLAS
uses it :)

>7)  I remember reading that the AMD 3dNow! had the same kni throughput
>    as the PIII, even though its mm registers were half as big.
>    Something else was doubled, but I can't find it now.  I know there
>    are still only 8 mm regs.  Anyone know the answer?  Should be easy
>    to make Athlon stuff from what we have.

I would also be interested in this answer.  We have a student visiting from
Denmark here, and he has been looking into 3DNow!, and he had the same idea:
defining macros that allowed the same code to be compiled for 3DNow or SSE . . .

>8)  Sure would be nice, since a copy is being done anyway, to align
>    data to 16 bytes.  Anywhere I can change this locally just to see
>    what it adds to the performance?

Yep, in ATLAS/include/atlas_misc.h, change 
#define ATL_Cachelen 32
to:
#define ATL_Cachelen 128

Thanks for all the work,
Clint