[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



>You are of course right here, and I think, contrary to my earlier
>guess, the complex case shows this to be the case even more.  But I'm
>a bit confused:

I am guessing these timings are for the 450Mhz PIII you talked about
earlier?  If so, I am indeed gratified to see CGEMV outperforming DGEMM . . .

>What appears to be going on here is that the extra writes in the N
>case pollute the L1 cache in an erratic fashion.  Apparently the nta
>doesn't guarantee that the data is in all levels of cache, making
>this disruption more evident.  The single precision appears to be
>entirely ram bandwidth limited, but then why does the axpy in the N
>case do *better*?  At least this seems to indicate that the complex N
>case could \profit from a ddot implementation, no?

For the NoTranspose case, the axpy code accesses the array contiguously,
which is much more efficient than the lda-strided access dictated by a
dot-based implementation.  You can get around some of this by doing
at least cache line length number of dot products at once (thus using
entire cache line that is fetched, as you naturally do with the daxpy-
code), but this is extremely sensitive to the lda (then you must use your
big stride trick to handle the lda % cachelen case), which makes for a 
poor kernel.  Also, if the hardware is doing further prefetch, it always
guesses in the contiguous dimension.  Finally, for large matrices, strided
access along the array causes TLB misses, and this is a killer, so ATLAS
has to block to avoid it, which in turn means the kernel operates on shorter
ddots, thus reducing the ddot implementation's effectiveness . . .

Since fetching A is the dominant cost of any GEMV implementation, getting
this contiguous access on A overwhelms the natural advantage of dot-based
implementations in almost all cases where the L1 cache is
non-write-through . . .