[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Intel SSE and Atlas.. some questions.




Thanks for spending 3 weeks in hell getting Atlas to the stage it is
with drop in user matmul kernels :)

Last year I wrote Emmerald, a GEMM for Intel PIII chips that uses 
the S.I.M.D single prec. instructions. I'm currently trying to
incorporate it into ATLAS.

I have a basic kernel working for the beta=0 case. Before moving onto
bigger and better things, I have a few questions and comments...

1) The Makefile in Atlas/tune/blas/gemm/<arch>/Makefile was broken.
It didn't compile the user kernels in <pre>mm.c before attempting to 
link the x<pre>mmtst program.

Changing line 582 from 
mmtstcase0: $(SYSdir)/time.o $(INCdir)/atlas_type.h

to 

mmtstcase0: $(SYSdir)/time.o $(INCdir)/atlas_type.h $(mmobjs)
  
2) the macro ATL_USERMM  only seems to work for the beta=0 case, which
contradicts the example in section 4.3 of the "User Contribution to
Atlas" document.

3)... okay.. those are my only gripes... more positive stuff...

I immediately got 580Mflops out of ATLAS on a PIII@450MHz, NB=80
without too much trouble.

If I supply a kernel with restrictions <MB>=1 <NB>=5 <KB>=4 (which can also
handle multiples thereof), will ATLAS search for the optimal 
NB with those restrictions? Secondly, assuming it's using the kernel as 
its main matmul, will it also try and use the user kernel to generate 
some of the cleanup code or will it only use its internal stuff 
(therefore needing to be indpenedently re-written to use SIMD instructions).

4) Emmerald copies one of the input matrices to a buffer (in a similar
way to the ATLAS panel copy). Rather than a block-major format of
column-major panels, it uses a strange format which specifically
allows the floating point pipeline to avoid excessive pointer 
arithmetic. It also aligns the data to allow aligned mov instructions to
be used (a small performance boost). To get performance I am having to 
do a copy to my format within the user kernel matmul, presumably this will
happen on top of the ATLAS higher level panel copying, an obvious
performance hit. Is it possible to hack Atlas to modify the panel copy
of either the A or B matrix to do the reformatting and data alignment?

Anyway, enough for now... I guess I should ask if anyone has attempted
integrating Intel SIMD to ATLAS already?


-- 
-Doug  -- http://csl.anu.edu.au/~daa, Ph:(02) 6279-8608, Fax:(02) 6279-8651