[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

P4 & Athlon



Guys,

>Just wondering if in principle the p4 can be made this good.  My
>understanding is definitely no with the normal x87 cpu, as the Athlon
>can do 2x here but the p4 1x.  Theoretical peaks for p4 SSE2 and
>athlon x87 cpu should be the same, right?  Those SSE2 instructions are
>big, and no consideration of instruction decoding has at least entered
>my mind in writing anything, but I would still be very surprised if
>such considerations could produce an extra 30%.  My suspicion is that
>the P4 SSE2 attainable peak is simply less than that of the Athlon x87
>cpu primarily due to the instruction size and the load this puts on
>the decoder, but that's just a guess.  Anyone have some real
>information here?

As Peter said, I think the trace cache should alleviate decoding probs.  I
believe that there is certainly more to be gotten out of SSE2.  With no
decoding probs, and greater bandwidth to memory, I would be surprised if
SSE2 can't get a decent amount better when compared to the Athlon.

Historically, it usually happens that we reach a plateau of performance
because no one is beating us.  MKL will release the next version, blow our
doors off, and suddenly someone will find an extra 10%.  This has happened
a lot for me in the past, anyway.  Once you are the fastest in town, you
kind of get fat and happy, and you suspect there is no more to be found.
Then, someone comes along and beats you like a drum, and that spurs you
into the proper motivation to find the next performance trick :)

I do think prefetch ought to help.  As a matter of fact, I took one of Peter's
SSE2 kernels and hacked in some prefetch, for a massive 3% speedup (kernel
timing only).  I didn't have much time to spend on it, so I didn't pursue.
One thing to be aware of here is that the P4 has an **128** byte long cacheline.
That is right: if you want to avoid repeated prefetch on the same line, you
need to unroll the double precision loop by *16*!

I would think minimally fetching the next cols of B while you operate on
these ones would be benificial.  If you still have prefetch room, repeatedly
prefetching A as you use it can keep cache-line eviction costs down.  Then,
you unroll the last iteration of the N loop, so you can prefetch the next
mu/nu cols of A/B during that iteration, and so on . . .

With all the x86 archs, you really have to be careful with prefetch, since
it can hurt as well as help (I've found you can be much more lavish with
prefetch on the ev6, for instance).  This means you wind up needing to
unroll a lot of things to avoid unnecessary prefetch (unroll last iteration
of K loop to get next cols of A, unroll the last iteration of M loop to get
next cols of B, unroll last iteration of N loop to get next kernel call's
cols, etc) . . .

>Greetings!  Does this mean that all kernels can be compiled from
>source, be it .c or assembler, now?  This is going to be a *lot*
>easier to deal with in a distributed atlas package, such as that in
>Debian, than shipping .o files.

Yes, everything is now being compiled, no object files.  Julian submitted
some improved kernels, but I was not able to get big improvements in the
few minutes I thought I could spare at this point, so I decided to go with
what we already had working.  There is not room for dramatic improvements,
so keeping the stable using the already-incorperated compilable files seemed
to be the way go . . .

Cheers,
Clint