[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Intel SSE and Atlas.. some questions.




Doug,

>Thanks for spending 3 weeks in hell getting Atlas to the stage it is
>with drop in user matmul kernels :)

Good, good.  Shields down, phasers not locked on . . . 

>Anyway, enough for now... I guess I should ask if anyone has attempted
>integrating Intel SIMD to ATLAS already?

As far as I know, no one is doing it for SGEMM.  Someone (I have not been
able to browbeat him into signing up for the mailing list) has already
contributed SGEMV and SGER SSE kernels, however.  I think he got started
on that by scoping your emmerald stuff . . .

>Last year I wrote Emmerald, a GEMM for Intel PIII chips that uses 
>the S.I.M.D single prec. instructions. I'm currently trying to
>incorporate it into ATLAS.

Excellent!  I remember reading your paper . . .

>I have a basic kernel working for the beta=0 case. Before moving onto
>bigger and better things, I have a few questions and comments...
>
>1) The Makefile in Atlas/tune/blas/gemm/<arch>/Makefile was broken.
>It didn't compile the user kernels in <pre>mm.c before attempting to 
>link the x<pre>mmtst program.
>
>Changing line 582 from 
>mmtstcase0: $(SYSdir)/time.o $(INCdir)/atlas_type.h
>
>to 
>
>mmtstcase0: $(SYSdir)/time.o $(INCdir)/atlas_type.h $(mmobjs)
>  
>2) the macro ATL_USERMM  only seems to work for the beta=0 case, which
>contradicts the example in section 4.3 of the "User Contribution to
>Atlas" document.

*Ahem*  Just a slight error, not compiling the needed code, *Ahem*.
I must have screwed this up after making the paper (ah, the beauty of last-
minute "optimizations").  I've included a corrected Makefile below, save it
over ATLAS/makes/Make.mmtune and ATLAS/tune/blas/gemm/<arch>/Makefile
to fix both problems . . .

>If I supply a kernel with restrictions <MB>=1 <NB>=5 <KB>=4 (which can also
>handle multiples thereof), will ATLAS search for the optimal 
>NB with those restrictions? 

Yes, though ATLAS currently restricts itself to NB <= 64.  It does this mainly
to cut down on time spent it cleanup (eg., NB=128 might have many users calling
cleanup exclusively), and I never found a platform where larger blocking
factors were significantly better . . .

>Secondly, assuming it's using the kernel as 
>its main matmul, will it also try and use the user kernel to generate 
>some of the cleanup code or will it only use its internal stuff 
>(therefore needing to be indpenedently re-written to use SIMD instructions).

At the moment, ATLAS will use its own cleanup.  However, I defined
the syntax of the gemm/CASES/?cases.dsc so that I could use the
provided kernels for cleanup as well.  So the answer to your question
is that the code on the developer page will not presently use your
code for cleanup, but, provided with a contributed kernel that kicks
ATLAS's cleanup's butt, I will add the required functionality.
I wanted to get a first draft of something developers could use out the
door, and I figured if no one ever contributed any GEMM kernels, or
using ATLAS's cleanup wasn't too bad, there was no need for me to
spend anymore time in ATLAS install hell . . .

>4) Emmerald copies one of the input matrices to a buffer (in a similar
>way to the ATLAS panel copy). Rather than a block-major format of
>column-major panels, it uses a strange format which specifically
>allows the floating point pipeline to avoid excessive pointer 
>arithmetic. It also aligns the data to allow aligned mov instructions to
>be used (a small performance boost). To get performance I am having to 
>do a copy to my format within the user kernel matmul, presumably this will
>happen on top of the ATLAS higher level panel copying, an obvious
>performance hit. Is it possible to hack Atlas to modify the panel copy
>of either the A or B matrix to do the reformatting and data alignment?

The answer is yes, but not easily with the tarfile you have.  I am still
working on extending the contributability, if you will, of ATLAS.  I just
added a new way of contributing to ATLAS: user's can now contribute an entire
GEMM routine, which effectively winds up being a drop-in replacement for
ATLAS's copy matmul.  I.e., the contributer must have a full GEMM to do this
(he has to handle cleanup himself, handle all the transpose cases, etc).
The threshhold of inclusion in the tarfile is far higher for this way of
contribution, because, due to time constraints, ATLAS blindly replaces its
own copy matmul with the user's matmul, so the user's gemm must beat ATLAS
soundly for all problem shapes, and be at least as good as ATLAS when used
as a kernel for the other Level 3 BLAS.  ATLAS still uses its own non-copy
code, and determines the crossover points, so efficient handling of small
problems or degenerate cases like the true inner product is not required.
Using this method, I have just finished incorporating Kazushige Goto's
excellent Compaq/Dec alpha GEMM into ATLAS (hello, 92% of peak on an ev6).

Again, I strongly prefer the kernel method, but this does supply a way for
codes that need control over the data format and caches to get it.  So
if necessary, we can go this route.  I hope to put a new developer release
with this stuff soon.  I was planning on getting the GEMV/GER SSE stuff
incorporated, and updating the user contribution working note first, though.

Can you explain a little bit about your data format?  I thought Transpose,
Notranspose block-major was about as pointer-efficient storage as you
could do for any matmul with K as the inner loop.  You aren't, by chance,
using M or N as your innermost loop?

Cheers,
Clint