[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



>2) The sse gemv/ger work great.  I noticed you included only the cases
>   that compiled best on your hardware.  Is it plausible that some of
>   the other unrolling options would be better on different
>   incarnations of the PIII, and that the timer should try them all
>   out when building the library?

I hate to have 20 cases for system-specific implementations.  When I add
a general case, my thought is that the extra install time is justified because
not only will it help the arch it was written for, but it may well be the best
for an arch I have never heard of.  For a system-specific implementation, I
am more begrudging on install time.  Obviously, if we find a case where the
parameters I have chosen are bad (say it is the best for PIII with on-chip
L2, but 10% slower than the best for PIII without), I would add the extra
case, but keeping 10 or 20 cases that only work for Intel SSE-enabled
architectures is something I'd like to avoid (think of all the system-
specific code we hope to have : 3DNow!, Altivec, etc).

If you look in the package, I actually have written a transpose, GEMV
generator.  Right now, I don't use it because install time is too high 
(the package also includes a search routine, which I gave up on for a
while) . . .

>3) prefetcht0 -> prefetchnta = +20% !  I'll be forwarding you some new
>   headers soon.

Yow!  You mean another 20% from what we've already got for SGEMV/SGER?
If so, sweet . . .

>4) The complex case is about done, and looks very good, as you
>   expected.  I'm having trouble tuning these as the timer results
>   jump around *a lot*, even when I use -DWALL on time.c

Excellent!  Have you tried setting -DPentiumCPS=<Mhz>?  That will give you
access to the cycle-accurate assembly language timer . . .

>5) Next step on dgemv is to try to unravel your _mm.c and add
>   prefetch.  

Thanks for all the work.  This really sounds great . . .