[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: prefetch II



Camm,

>> The main point here is to highly recommend the technique Julian suggested
>> last time: time your code with no prefetch instructions on an in-cache
>> timing, and then make sure that in-cache number does not go down as you add
>> prefetch.
>
>Do you think this would apply to l2 and l1 as well?

Well, depends on what you mean.  In Level 3, I eliminated most non-free
prefetches.  My hunch is that for Level 2 and 1, I would use this technique
to find the cheapest way to do prefetch, but probably leave more of it in.
The methodologies of telling the ATLAS timer not to flush cache is different
in the levels.  Perhaps the easiest thing is to edit Make.inc, and set L2SIZE
to 1K or something like that.  It's kind of cumbersome when switching back
and forth between in- and out-of-cache timings, but it will work for all
timers . . .

>What does it mean to 'redo' cacheedge?  

I chose another CacheEdge.  I did it by hand-timings, because xdfindCE
lacked enough precision . . .

>It also would
>be nice to be able to come up with some theory as to the proper
>prefetch distance in each circumstance.

Yeah, that would indeed be nice.  Some theory suggesting how to achieve world
peace would be handy as well :)

Some archs have some theory on prefetch (AMD gives a formula that does a
decent, though not perfect, job, for instance), but while I tune for particular
archs, I want kernels work everywhere . . .

Speaking of which, I recommend you use the prefetch macros defined in
atlas_prefetch.h whenever possible.  This makes the prefetch portable across
architectures;  I'm also experimenting with turning off prefetch when I know
it is not needed (eg., second GEMV call in SYMV) by manipulating these
symbols . . .

>> I still am not happy that the kernel timer does not always predict the
>> best NB, and that it seems to be pretty innaccurate when prefetch comes in,
>> but it is not clear to me what to do about it . . .
>> 
>> The kernel does not include the low order overheads (data copy, outside loop
>> costs, movement of C, etc).  My guess is that in the past the use of CacheEdge
>> offset these losses so the predicted rates were fairly accurate.  With prefetch,
>> you are doing something CacheEdge will do, so you look a lot better in the
>> kernel, but in practice the improvement is not so great.  Since you don't
>> have CacheEdge offsetting the overheads, your kernel timings wind up higher
>> than your full gemm.  Or at least that one fairly random guess :)
>> 
>
>The only thing about this theory is that it would require the
>non-kernel overhead to be small, but measurable.  Your 70 -> 70.6
>example above doesn't seem to bear this out :-)

I think you may misunderstand what I'm suggesting.  The numbers you point to
is what caused me to take this guess as to what is going on, not a weakening
of the argument.  The idea is that usually, CacheEdge counteracts the
overheads of the non-kernel code.  Prefetch, however, steals the thunder of
CacheEdge (CacheEdge is not used in kernel timing), so that you get real
noticable difference between full and kernel timings.

To be  perhaps clearer, what I am suggesting is that the number you see in
the kernel is:

kernel time : kernel
full   time : kernel + CacheEdge - overhead

Now, without prefetch, the win with CacheEdge is substantial (2-20%).  With
prefetch, the kernel itself contains most of the goodness of CacheEdge,
reducing the CacheEdge in the above equations to almost 0, thus making
the kernel time much less accurate (the kernel would be perfectly predictive
in this scenario if CacheEdge == overhead) . . .

>1) Cannot get any prefetch benefit for N=1000000 on nrm2 or scal on
>   torc.  Previously, I had found prefetch to account for roughly half
>   the l2 speedup on the PIII.  Completely mystified here.  It seems as
>   though the P4 is already doing some kind of 'autoprefetch', is that
>   possible?  

Most machines do some version of hardware prefetch, but usually software
prefetch is still helpful.  The first thing is to make sure you aren't
swapping . . .

>2) In cache - 1.3Gflops double nrm2, 2-3Gflops single nrm2, 1.5Gflops
>   single scal.  Out of cache - 450 Mflops double nrm2, 880 single
>   nrm2, 250 single scal.  Compared with x0 non-SSE routines: 550,
>   580, 1 Mflop, 360, 516, 250.  In cache scal for x0 has some real
>   weirdness going on.  All of this speedup is due to SSE and not
>   prefetch, as far as I can tell.  Rough first attempt numbers.

I have no idea at all what this means.  My experience on AltiVec was that
it gave me no speedup for out-of-cache L1 for real, but did provide speedup
for complex.  This is obviously because memory was the complete bottleneck,
so doubling the computational power is of no benefit.  The G4, however, does
2 flops/cycle instead of one, and doesn't have near the bandwidth to memory
that the P4 has, so I would guess the P4 would show some speedup, particularly
for read-only ops (nrm2, iamax, etc) . . .

>3) No way to align 'dot', right?

Not practically.  In theory, you can skew the various loops, with all ways
they can be mutually misaligned (X=Y+1 more, X=Y+2 more, etc), but would
rather die than actually do it, and it would require extra space or lots
of regs to support . . .

>4) In cache numbers never see reality, right?  

A user calling the op on in-cache operands would see it.  I'm always on the
fence if L2 and L1 blas should assume being in-L2 cache or not.  Calling them
out-of-L2 is going to be bad news, but that doesn't mean the user doesn't
do it . . .  There is no way I know of to detect at the beginning of the op
whether the operands are already in cache, so I tend to take the bad case
as the one we should optimize for

>5) How do I figure out the max throughput from main memory with no
>   cache misses?  Is it possible that torc is so fast that it can
>   basically process all of cache in l1 type operations in less time
>   then it minimally takes to fill it?

I'm not sure what you are asking here.  Is it possible for a computation
to be completely memory bound?  Sure.  Some machines are, even on L2 blas.
On the P4, I've had good luck with prefetched L1 read-only ops not being
memory bound, but who knows . . .

>6) what l1 increase is worth it?  Should be attainable?  does MKL
>   show? 

Nothing shows.  The way it usually works is you work real hard, think you've
got the best, and then someone beats you, and you suddenly figure out a way
to improve :)

>Sorry, one other item, the sse code fails the last two nrm2 tests for
>overflow and underflow.  I take it by some comments that I've seen
>that this is do to the lack of extended precision, yes?  Is it
>possible to fix this in the testers?

No.  The error is in your routines.  atlas_contrib.ps points out that this
routine is in the BLAS primarily for accuracy, not speed; scope that little
blurb out.  If you can't meet the accuracy requirements, you need a different
implementation.  When you have only native precision (64 bit output, using
64 bit computation (SSE2)), you need to use sum of squares algorithm, as in
nrm2_ssq1_x1.c.  My guess is a properly done SSE2 cannot run as fast as an
x87 algorithm (which uses 80-bit accuracy on 64-bit operands), so I wouldn't
do it if I were you.  For single precision, you'd either need to use SSE2
(double computations with single input, as in sdnrm216p_x1.c) or an
SSE-enabled sum of squares.  Of these, SSE2 would have my money for ease of
implemenatation.

Cheers,
Clint