[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ATLAS developer release 3.1.1



Camm,

>About the dgemv, I can get 20% or so over the best atlas routine
>(_mm.c) using prefetch.  But when redefining prefetch to nop in my
>code, the performance improvement from prefetch is more like 100%.
>Which makes me think that the best strategy would be to take atlas' mm
>code and just add a few prefetch commands where necessary.
>Unfortunately, I haven't yet spent the time to know where the real
>core routines are, there being so many includes and all.  Can you give
>me a pointer here?

I'm not sure I understand the question properly, but the GEMV kernels
supported by ATLAS are in ATLAS/tune/blas/gemv/CASES.  You can see what
code was used by ATLAS by cat ATLAS/tune/blas/gemv/<arch>/res/dMVRES,
which, on my machine is unfortunately:
1 0 0 0.51 93.09 ATL_gemvN_mm.c
1 0 0 0.51 91.24 ATL_gemvT_mm.c

Which means that GEMV is calling GEMM :)  You can see the performance of
each routine in the index file (dcases.dsc) by 
cat res/dgemvN_? (res/dgemvT_? for transpose).  On my machine, however, the
performance of the non-GEMM codes is much lower than the performance of GEMM
(matmul gets 93Mflop, next closest routine is 70), so I guess you could add
some prefetch to the GEMM kernel if you wanted, but this would take some
twiddling by hand to move the gemm kernel into the GEMV directory, and 
make it directly callable as a GEMV kernel.  Probably more pain than you
want to go through, is my guess . . .

I'm not sure why the GEMM kernel is so much better than any of my hand-written
kernels; the most likely suspect is that the GEMM kernel unrolls the inner
loop to a depth of 40; Pentium's love extremely high inner loop unrollings . . .

Hopefully this has provided a starting point.  Let me know if you need more
info . . .

Cheers,
Clint