[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Testing ATLAS with user contributed code.



Peter,

>One of the reason I wanted to get this testing working is to experiment
>with the non-temporal move instructions that are present in both AMD and
>Intel chips. They can move data without polluting the caches, but can this
>be any benefit with ATLAS' way of copying matrices? Does it matter how the
>kernel acces memory or is it only relevant in the copying code? Is there
>any difference in the way that the A and B matrices are handled by the
>copy code and the code that calls the kernels?

In general, the copy is a low order term (O(N^2) vs O(N^3) computation).
Since ATLAS does not use the copy kernels for very small problems, this
means that the copy does not have a large impact in general.  You'll get
some advantage by not having the copy of the column panel of B flush things,
but I'm not sure you will see it performance-wise.  

The case where you have the most chance of the copy effecting things is very
large cases for which ATLAS cannot malloc enough workspace.  In this case,
the copy is not purely N^2 anymore (see page 12, atlas_over.ps for more info),
and the copy may become signficant.  Purely to demonstrate the win, you might
think of setting ATLAS/include/atlas_lvl3.h's ATL_MaxMalloc to 1024 or
something like that (that way ATLAS will always be in the case described
above: not able to copy all of one matrix).  The gap between ATLAS performance
before the change and after should be your theoretical peak speedup . . .

Cheers,
Clint