[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Alignment of cleanup routines



Camm,

>I may be misgrokking something, but it seems to me that all this tells
>us is the relative alignment between subsequent rows of a and b.  The
>starting a and b themselves need not have the same alignment, right?
>(what I mean by this is a%16 =/!= b%16)  

OK, so my understanding is that A%16 == B%16 == 0 forall routines, excepting
the K-cleanup cases.  So, It's clear why the K-cleanup cases don't have
the required alignment, but I guess the M & N cleanup is less so.  Here
is how I think it works: ATLAS now has a macro called ATL_MinMMAlign, which
is the minimum alignment A and B must have in order to use the copy-kernel.
ATLAS also dictates that (NB*sizeof()) % ATL_MinMMAlign == 0.  When ATLAS
copies the matrices, the outer matrix is always copied into a panel of size
NB * K, and the outer matrix is in a workspace varying between 
NB*K <= worksize <= [M,N]*K, depending on whether outer matrix is A (M)
or B (N).  So, the inner panal is malloced with the correct alignment,
and the outer workspace has the correct alignment because NB is a multiple
of ATL_MinMMAlign, and so any multiple of NB (in this case, K*NB) is aligned
as well.  For non-K cleanup, all blocks inside the panels are seperated
by NB*something blocks, so alignment between blocks is maintained.  For
everything except K-cleanup, each column is of length NB, so all columns
are aligned as well . . .

All these tricks break down for K-cleanup because of internal block alignment;
Each block begins on a multiple (since m=NB for A and n=NB for B), but the
internal columns are misaligned according to k.  The cases where more than
one dimension is less than NB are not handled by user-supplied cleanup anyway,
so user-contributed cleanup never sees a case where blocks of A or B are
not multiples of ATL_MinMMAlign (in this case, 16).

>So I guess the best strategy is to 
>1) use a 1x4

Hmm, don't know.

>2) give up on b alignment
>3) increment by 1 until a is aligned
>4) stride a by multiples of 4,2, or 1 depending on KB.
>Is this your understanding?  I was wondering basically if I could get
>b alignment as well.

Only for K-cleanup. M and N cleanup have the same alignment as full kernel.
For K-cleanup, A and B start with same alignment, but this doesn't help
you, since every column of A is applied to every column of B, and they
are all possible misaligned, so concentrating on getting one aligned as
you say is probably the trick.  Essentially, you can then use aligned
loads of A, and unaligned loads of B.

What Peter has done, and what I think makes a lot of sense, is having
cleanup for KB%4==0 cases which assumes alignment (and thus gets good
performance), and the non-aligned you discuss above for the KB%4 != 0
cases . . .

Cheers,
Clint