[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ATLAS



Hello again!

Two related questions:

1) You mention in the docs that 'the idea is that X comes from L1'.
   In your opinion, does this mean that accesses to X are virtually
   free, or just somewhat faster?  Is it worth copying X into
   registers?  

2) While this is just a preliminary guess, I surmise that the
   transpose case could also benefit from an axpy implementation, at
   least in the complex case.  The reason is, that in such an
   implementation, you can load real(x) into one xmm, img(X) into
   another, fix the signs *once* for the whole loop in the latter
   register, and run a loop like (load, mul, add, load, mul, shuf,
   add).  The other way, you have to add at least one extra
   multiplication and several unpck/shuf's per iteration.  I count 7
   vs. 14 instructions per iteration, though I may be able to lower
   the latter somewhat.  This may not pay for the data separation, but
   then again it may.

Take care,

R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Camm,
> 
> >2) The sse gemv/ger work great.  I noticed you included only the cases
> >   that compiled best on your hardware.  Is it plausible that some of
> >   the other unrolling options would be better on different
> >   incarnations of the PIII, and that the timer should try them all
> >   out when building the library?
> 
> I hate to have 20 cases for system-specific implementations.  When I add
> a general case, my thought is that the extra install time is justified because
> not only will it help the arch it was written for, but it may well be the best
> for an arch I have never heard of.  For a system-specific implementation, I
> am more begrudging on install time.  Obviously, if we find a case where the
> parameters I have chosen are bad (say it is the best for PIII with on-chip
> L2, but 10% slower than the best for PIII without), I would add the extra
> case, but keeping 10 or 20 cases that only work for Intel SSE-enabled
> architectures is something I'd like to avoid (think of all the system-
> specific code we hope to have : 3DNow!, Altivec, etc).
> 
> If you look in the package, I actually have written a transpose, GEMV
> generator.  Right now, I don't use it because install time is too high 
> (the package also includes a search routine, which I gave up on for a
> while) . . .
> 
> >3) prefetcht0 -> prefetchnta = +20% !  I'll be forwarding you some new
> >   headers soon.
> 
> Yow!  You mean another 20% from what we've already got for SGEMV/SGER?
> If so, sweet . . .
> 
> >4) The complex case is about done, and looks very good, as you
> >   expected.  I'm having trouble tuning these as the timer results
> >   jump around *a lot*, even when I use -DWALL on time.c
> 
> Excellent!  Have you tried setting -DPentiumCPS=<Mhz>?  That will give you
> access to the cycle-accurate assembly language timer . . .
> 
> >5) Next step on dgemv is to try to unravel your _mm.c and add
> >   prefetch.  
> 
> Thanks for all the work.  This really sounds great . . .
> 
> Cheers,
> Clint
> 
> 

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah