[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ATLAS developer release 3.3.1 is out



Julian,

>(I don't know how you calculate MFLOPS in a copy routine, but I
>think 920MB/s=58 MFLOPS).

Yeah, for routines like copy, we're calling it 1 (2 for cplx) flop per element,
though the correct number is, of course, 0.

>Maybe you recall that I have sent you a mail with an Athlon optimized
>STREAM some weeks
>ago. 

>In the sources you find a vector copy routine (dassign.asm) that
>copies a vector
>with ~920MB/s on my Athlon classic 600/PC133

Great!  I really couldn't get prefetch to do anything for me on the
copy the way I was doing things.  I knew there had to be a better way.
Just so you know, if you are hinting I should grab your stuff for the
Level 1, it will be a long time before I'm ready to get to this level
of detail.  I took the one day to proof the tools, but I'll be busy
adding level 1 ops and getting ATLAS CVS-ready for quite a bit of time . . .

As I said before, I mainly wanted to get something out so others had the
option of playing with the Level 1 (a couple of people have asked about
tuning the Level 1) while I did this boring infrastracture stuff in parallel,
thus possibly leaving me with less level 1 work to do once I'm ready to
start.  Also, I must admit that I'm still looking for some applications
to motivate me to get real excited about tuning the level 1 . . . 

>It uses MMX/3dnow instructions and bypasses the caches via movntq. 

Hmm.  This is an interesting point.  I must say that when I use dcopy,
I usually expect my output vector to be in cache for reuse, but obviously
skipping the caches for a copy is the way to go, and will kick butt for
those cases you do not plan to immediately reuse Y . . .
Can you cache the output vector and not the input?  That would be best
for most of my operations . . .

>> Along the same lines, I'm already considering adding support for
>> atlas_set (set a vector to a constant)
>
>dfill() of my STREAM fills a vector with zeros with amazing 1020MB/s on
>my machine. It can be easily modified for all precisions.

Great.  I just finished the first hack at ATL_set tuning, and the only
special case for alpha that I allow is for 0; I had heard of special
instructions of zeroing memory, glad to know we'll have it for x86 . . .
ATL_set with alpha=0 is used by ATLAS itself in various places, so this
should be nice indeed . . .

Thanks for the info,
Clint