[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Alignment of cleanup routines



Hi Clint!

OK, I think I understand now.  I'm a little confused as to why we
call NB that instead of KB, but I think I get the drift.  At

http://people.debian.org/~camm/sgemm_20001123.tgz

there is a stab at a M/N cleanup (ATL_sgemm_SSE_1x1xkb), assuming only
KB%4=0, KB=ldb=lda=NB, M and N runtime parameters.  Also included is
another update to the main sgemm, which alters the handling of beta,
and thereby noticeably improving complex and beta=x cases.  New
timings are below.  Also tried 1x5 and 1x6 strategies (i.e. 5 and 6
columns of a at a time), but this produced no clear advantage, plus
requiring that KB be divisible by 5 or 6 in addition to 4 (required
for alignment), effectively limiting KB to 60 (which happened to be
about the best anyway).  Code was also bigger, and in the 6 case,
there aren't enough registers to handle beta well.  I therefore didn't
include these at the url above, but I cite the timings below anyway
for general reference.

=============================================================================
ATL_sgemm_SSE  (1x4, KB%4=0, KB=NB=MB=M=N=compile time)

1870.535758 +- 6.330379 ATL_sgemm_SSE 60 s 0 moves=
1521.126667 +- 5.950514 ATL_sgemm_SSE 60 s 0 
1868.306061 +- 6.258576 ATL_sgemm_SSE 60 s 1 moves=
1499.363636 +- 5.456918 ATL_sgemm_SSE 60 s 1 
1852.698182 +- 4.940221 ATL_sgemm_SSE 60 s 2 moves=
1487.004242 +- 6.715148 ATL_sgemm_SSE 60 s 2 
1836.563333 +- 2.189387 ATL_sgemm_SSE 60 c 0 moves=
1614.678182 +- 4.751711 ATL_sgemm_SSE 60 c 0 
1830.222424 +- 2.821958 ATL_sgemm_SSE 60 c 1 moves=
1609.694545 +- 4.590589 ATL_sgemm_SSE 60 c 1 
1813.752121 +- 5.435238 ATL_sgemm_SSE 60 c 2 moves=
1601.388485 +- 3.901250 ATL_sgemm_SSE 60 c 2 

=============================================================================
ATL_sgemm_SSE_1x1xkb (KB%4=0, KB=lda=ldb, M,N runtime paramaters)
(This can still be improved, just a relatively quick attempt)


1777.392727 +- 3.408984 ATL_sgemm_SSE_1x1xkb 60 s 0 moves=
1431.476061 +- 5.571626 ATL_sgemm_SSE_1x1xkb 60 s 0 
1787.713939 +- 5.081814 ATL_sgemm_SSE_1x1xkb 60 s 1 moves=
1426.155455 +- 4.964096 ATL_sgemm_SSE_1x1xkb 60 s 1 
1769.283030 +- 1.887701 ATL_sgemm_SSE_1x1xkb 60 s 2 moves=
1430.022727 +- 4.897580 ATL_sgemm_SSE_1x1xkb 60 s 2 
1758.752727 +- 3.584418 ATL_sgemm_SSE_1x1xkb 60 c 0 moves=
1554.019394 +- 4.102429 ATL_sgemm_SSE_1x1xkb 60 c 0 
1747.281818 +- 5.047235 ATL_sgemm_SSE_1x1xkb 60 c 1 moves=
1540.130909 +- 2.620997 ATL_sgemm_SSE_1x1xkb 60 c 1 
1756.987879 +- 4.846072 ATL_sgemm_SSE_1x1xkb 60 c 2 moves=
1544.695758 +- 2.913440 ATL_sgemm_SSE_1x1xkb 60 c 2 

=============================================================================
1x5 strategy

1852.698182 +- 4.940221 1x5 60 s 0 moves=  
1495.246364 +- 6.011740 1x5 60 s 0         
1863.846667 +- 6.038048 1x5 60 s 1 moves=  
1490.245758 +- 8.228506 1x5 60 s 1         
1846.009091 +- 3.682223 1x5 60 s 2 moves=  
1486.569394 +- 5.048818 1x5 60 s 2         
1854.350000 +- 5.688195 1x5 60 c 0 moves=  
1621.323030 +- 4.716382 1x5 60 c 0         
1843.233333 +- 4.168469 1x5 60 c 1 moves=  
1613.016970 +- 4.716382 1x5 60 c 1         
1830.222424 +- 2.821958 1x5 60 c 2 moves=  
1609.813030 +- 5.220205 1x5 60 c 2         

=============================================================================
1x6 strategy  

1861.616970 +- 5.886441 1x6 60 s 0 moves=   
1506.960000 +- 7.139862 1x6 60 s 0          
1890.788485 +- 6.760493 1x6 60 s 1 moves=   
1511.156970 +- 6.693307 1x6 60 s 1          
1808.356364 +- 5.904534 1x6 60 s 2 moves=   
1505.596364 +- 7.393735 1x6 60 s 2          
1836.563333 +- 2.189387 1x6 60 c 0 moves=   
1624.645455 +- 4.590589 1x6 60 c 0          
1843.233333 +- 4.168469 1x6 60 c 1 moves=   
1623.102727 +- 5.274856 1x6 60 c 1          
1834.340000 +- 0.000004 1x6 60 c 2 moves=   
1616.339394 +- 4.769277 1x6 60 c 2          

=============================================================================

Take care,



R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Camm,
> 
> >I may be misgrokking something, but it seems to me that all this tells
> >us is the relative alignment between subsequent rows of a and b.  The
> >starting a and b themselves need not have the same alignment, right?
> >(what I mean by this is a%16 =/!= b%16)  
> 
> OK, so my understanding is that A%16 == B%16 == 0 forall routines, excepting
> the K-cleanup cases.  So, It's clear why the K-cleanup cases don't have
> the required alignment, but I guess the M & N cleanup is less so.  Here
> is how I think it works: ATLAS now has a macro called ATL_MinMMAlign, which
> is the minimum alignment A and B must have in order to use the copy-kernel.
> ATLAS also dictates that (NB*sizeof()) % ATL_MinMMAlign == 0.  When ATLAS
> copies the matrices, the outer matrix is always copied into a panel of size
> NB * K, and the outer matrix is in a workspace varying between 
> NB*K <= worksize <= [M,N]*K, depending on whether outer matrix is A (M)
> or B (N).  So, the inner panal is malloced with the correct alignment,
> and the outer workspace has the correct alignment because NB is a multiple
> of ATL_MinMMAlign, and so any multiple of NB (in this case, K*NB) is aligned
> as well.  For non-K cleanup, all blocks inside the panels are seperated
> by NB*something blocks, so alignment between blocks is maintained.  For
> everything except K-cleanup, each column is of length NB, so all columns
> are aligned as well . . .
> 
> All these tricks break down for K-cleanup because of internal block alignment;
> Each block begins on a multiple (since m=NB for A and n=NB for B), but the
> internal columns are misaligned according to k.  The cases where more than
> one dimension is less than NB are not handled by user-supplied cleanup anyway,
> so user-contributed cleanup never sees a case where blocks of A or B are
> not multiples of ATL_MinMMAlign (in this case, 16).
> 
> >So I guess the best strategy is to 
> >1) use a 1x4
> 
> Hmm, don't know.
> 
> >2) give up on b alignment
> >3) increment by 1 until a is aligned
> >4) stride a by multiples of 4,2, or 1 depending on KB.
> >Is this your understanding?  I was wondering basically if I could get
> >b alignment as well.
> 
> Only for K-cleanup. M and N cleanup have the same alignment as full kernel.
> For K-cleanup, A and B start with same alignment, but this doesn't help
> you, since every column of A is applied to every column of B, and they
> are all possible misaligned, so concentrating on getting one aligned as
> you say is probably the trick.  Essentially, you can then use aligned
> loads of A, and unaligned loads of B.
> 
> What Peter has done, and what I think makes a lot of sense, is having
> cleanup for KB%4==0 cases which assumes alignment (and thus gets good
> performance), and the non-aligned you discuss above for the KB%4 != 0
> cases . . .
> 
> Cheers,
> Clint
> 
> 

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah