[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

prefetch II



Guys,

The main point here is to highly recommend the technique Julian suggested
last time: time your code with no prefetch instructions on an in-cache
timing, and then make sure that in-cache number does not go down as you add
prefetch.  You then have a pretty good idea that the prefetch will not be
adding overhead, even if the prefetch is useless.  Of course, you still do
out-of-cache timings in order to see what prefetch you need . . .

The amazing thing to me is that I was too bone-headed to apply this to prefetch,
since I have used this technique with other stuff (eg., register prefetch).
For those wanting to employ it, if you pass moves="" in your ummcase line
of the kernel timer, the timer will leave all operands in place, and thus
time again and again on the in-cache data (assuming your nb is small enough
to keep the operands there) . . .

I applied this technique to two previously written kernels, with modest results.
On the ev6, the kernel runs at 94.5% of peak when in cache.  I managed to
get the prefetched kernel to clock in at 91% of peak when the kernel timer
flushed 10 times the actual cache size (more like 93% if I just flush the
cache size).

However, this "best" case according to the kernel timer only got around 86%
of peak for the full gemm.  Taking a smaller nb, that the kernel timer claimed
got roughly 89-90% of peak, got my full gemm up to 88% of peak.  I would now
take a bow, if Goto's GEMM hadn't been acheiving 92-93% of peak for several
years :)

So, that seemed about as far as I could push my ev6 performance, so I returned
to the sight of my former humiliation on the PIII.  If you remember, I had
a kernel that got great performance according to the kernel timer, but ran
slower in full gemm than no prefetch at all.

The new kernel (with all no-overhead prefetch) clocked in at 76% of peak, 
whereas the generated kernel got a puny 70%.  The full gemm based on the
generated kernel peaked around 70.6% of peak.  The full gemm based on my
mighty new kernel peaked at . . . 71% of peak.  Wow, what a difference.
One second before hurling my laptop across the room, I timed LU, and
found a 3-10% performance advantage for the new kernel over generated for
LU, so it seems that the gemm timer is not telling the whole story . . .

My guess is that CacheEdge helps full gemm out quite a bit, but LU's gemm-calls
don't always have a good shape for CacheEdge, and then the prefetch can help
matters (if CacheEdge is rolling, you don't really need prefetch as much, since
the operands are already L2-contained).

Anyway, if you mess around with prefetch, do not forget to redo cacheedge.
When I first timed my new full gemm, it was slower than the old, until
I adjusted this quantity . . .

I still am not happy that the kernel timer does not always predict the
best NB, and that it seems to be pretty innaccurate when prefetch comes in,
but it is not clear to me what to do about it . . .

The kernel does not include the low order overheads (data copy, outside loop
costs, movement of C, etc).  My guess is that in the past the use of CacheEdge
offset these losses so the predicted rates were fairly accurate.  With prefetch,
you are doing something CacheEdge will do, so you look a lot better in the
kernel, but in practice the improvement is not so great.  Since you don't
have CacheEdge offsetting the overheads, your kernel timings wind up higher
than your full gemm.  Or at least that one fairly random guess :)

Cheers,
Clint