[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

prefetch ravings


This is just some musings on prefetch in level 3 kernels.  If you are not
interested, just ignore.  

I've been confused by various aspects of prefetch, and have done a few
experiments that leave me with a little knowledge, and a lot of superstitions
and suppositions.  Seeing other people's kernels, hearing their
explanations of what they do, has often spurred me to try new things,
and I thought exposure to some stuff that didn't work well enough to become
embodied into kernels might nonetheless help other people with their own
thinking, even if all my speculations are in error . . .  Obviously,
I hope as well that someone may figure out what I don't have the ooomph
for myself as well . . .

So, this is just a rambling series of observations, but perhaps useful in
that sense.  All of this is for Level 3 gemm kernel only: things are much
more clear for Level 2/1 (at least I believe they are :) . . .

In most of my prefetch success stories, I've used it on the current block.
Mostly, what I'm doing here is prefetching the forthcoming panels of $B$,
and repetively fetching blocks of A and B that should already be in cache,
but may have been knocked out due to cache-line conflicts.  This approach
gave modest speedups on the alpha and IA64 kernels, and relatively large
speedups on the AltiVec . . .

Julian used a different idea: reducing the block size so you can keep 
extra stuff around, and prefetching the next iteration's block during
this iteration.  This is obviously going to work better on machines with
large caches, such as Athlon or alpha, than otherwise.  On a practical
note, I've since tried this scheme in some fashion or another on UltraSparcs
(see below) and PIII.  I have not tried it on IA64 or alpha, where it 
probably has a much better chance . . .

Since I hadn't thought to try it, it spurred me to try something
similar for the PIII.  My idea was, even if the 16K L1 was too small to
do the exact same trick, it might help to fetch the next iteration's data
into the L2 (you can do this with SSE).

This didn't work at all.  What did work, was to unroll the loops so I was
never doing any of the repetitive prefetching, but do prefetch this iteration's
data before its first use.  Using this technique, I got a 5% faster
kernel for the PIII!!

I was real excited, and so I quickly built it into a full gemm, in order to
get results 10% slower than the non-prefetched kernel.  How can the prefetch
get a speedup for the kernel timer, and a slowdown for the actual code?

Well, looks like the answer is explicit and implicit L2 blocking.  ATLAS
does explicit L2 blocking with the CacheEdge parameter, and the algorithm
itself does it implicitly in many situations.  Prefetch is a pure win when
everything is in main memory (as the kernel timer makes it), but when
a lot of stuff is already in L2, it appears to be an overhead, and you
get a practical slowdown.

I'm not saying prefetching cannot be used to speedup the PIII; I suspect that
it can, but I have not yet found the right way to go about it.  I do believe
keeping in mind that the full gemm may be keeping things in L2 for you can
help explain some effects . . .

Another arena of mystery for me is the UltraSparc II processor.  It has a
prefetch command that I figured out how to use using gcc (but not using
Sun's cc).  I know the command works, 'cause I can speed up the Level 1
using it.  However, no matter what I do with it in the Level 3 BLAS, I get
no difference above clock resolution.

Let me restate that: I can do any kind of prefetch I want, prefetching anything
or nothing, with no change of state.  No noticable overhead, no perceptible
speedup.  It's maddening.  I think I might have written a kernel that was 
5 Mflop faster when the prefetching was on, but this may have been fevered
imagination only.

I originally thought the explanation of this was that the US II prefetch
was to the L2 instead of the L1.  If this were the case, I really expected
big advantage by using Julian's approach of prefetching the next block while
operating on this.  Can you guess the result?  No perceptible difference . . .

If anyone knows any inside stuff on your UltraSparc II prefetch works, I'd
certainly like to hear it.  The documentation is incredibly sparse (it took
weeks of digging for me to find the assembly commands to make it happen) . . .

I have not tried on an USIII much, since I don't have good access, and I hear
that prefetch is turned off in hardware on a lot of systems.