[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SSE Level 3 drop in gemm



Greetings!

R Clint Whaley <rwhaley@cs.utk.edu> writes:

> If you wind up with a kernel, turn it in anyway.  Kernels are preferable to

OK

> >2)  The xsmmtst always doubles ldc, even with single real precision.
> >    This makes it difficult to fully capitalize on he compile-time
> >    constant nature of the dimensions (i.e. one must read ldc runtime
> >    if one wants a routine that will past both the tester and the
> >    timer.) 
> 
> That's why the macro NB2 exists: it is just NB*2 as a constant . . .
> Is this what you are talking about?
> 

Actually, I was in error here.  For some reason, I thought that C was
copied as well to "block-major" storage, but ldc really is a runtime
parameter.

> >3)  I found it useful to also define NB4,MB4, and KB4 in emit_mm.c,
> >    for obvious (In the case of SSE) reasons. 
> What are these macros?  NB*4?
> 

Yes.  Currently I define these and KB8... in emit_mm, which I think is
a bit excessive.  I can get around it with my (very ugly) cpp
arithmetic hack if necessary.  Does anyone know a better way of doing
simple arithmetic in cpp?  The result cannot be an expression, but an
actual number fit for an assembler string.  Currently, I do the
following

#define P_1008_252   1260
#define P_1008_256   1264
...

#define XS(a_,b_)     P_ ## b_ ## _ ## a_
#define S(a_,b_)      XS(a_,b_)


In this particular case, I'm trying to avoid storing lda (for example)
in a register, and rely on its being a compile time constant.
Otherwise, I could do something like "nn(%eax,%ecx,4)".


> >8)  Sure would be nice, since a copy is being done anyway, to align
> >    data to 16 bytes.  Anywhere I can change this locally just to see
> >    what it adds to the performance?
> 
> Yep, in ATLAS/include/atlas_misc.h, change 
> #define ATL_Cachelen 32
> to:
> #define ATL_Cachelen 128
> 

OK.  I did this, but it doesn't seem to affect the mmtst.c and fc.c
programs for testing and timing respectively.  fc.c already aligns
things quite nicely, but I've added the following ugliness to mmtst.c
(at line 528) so far:

   if (((int)C0)%16)
     C0+=4-(((int)C0)%16)/4;

It turns out that alignment helps *a lot* in this case.  The kernel is
up to 666MFLOPS, most of the gain over the previous 550 being in
alignment (and its consequent simplifications).  

A few other items:

1) Taking a working sgemm and testing with pre=c fails to compile,
   failing to find the b1 and bX routines.  The emit_mm added headers
   defines ATLAS_USERMM to b0, how are the others supposed to link in?
   I'll temporarily get around this by changing the name of the
   routine according to BETA
2) I currently have a very small, but frustrating kludge in the
   kernel.  For some reason, calling my assembler with the __asm__
   __inline__ (... :::"ax","bx",...); construct does not end up
   pushing the registers that fc.c is using, leading to a segfault
   unless I add an arbitrary "push %ebx\n\t"/"pop %ebx\n\t" pair
   around the kernel.  
3) Currently, the search algorithm selects 56 for nb, apparently due
   to the size of my cache.  How variable is this?  I'd like to define
   macros that unroll the k loop maximally according to KB.  What is a
   reasonable upper bound on this?  Would having such a variable
   unrolling confuse the search process, which reads ku from
   scases.dsc? 

> Thanks for all the work,
> Clint
> 
> 




-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah