[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: R Clint Whaley <rwhaley@cs.utk.edu>*Subject*: Re: Alignment of cleanup routines*From*: Camm Maguire <camm@enhanced.com>*Date*: 23 Nov 2000 01:17:54 -0500*Cc*: atlas-comm@cs.utk.edu*In-Reply-To*: R Clint Whaley's message of "Wed, 22 Nov 2000 18:07:11 -0500 (EST)"*References*: <200011222307.SAA22406@enterprise.cs.utk.edu>

Hi Clint! OK, I think I understand now. I'm a little confused as to why we call NB that instead of KB, but I think I get the drift. At http://people.debian.org/~camm/sgemm_20001123.tgz there is a stab at a M/N cleanup (ATL_sgemm_SSE_1x1xkb), assuming only KB%4=0, KB=ldb=lda=NB, M and N runtime parameters. Also included is another update to the main sgemm, which alters the handling of beta, and thereby noticeably improving complex and beta=x cases. New timings are below. Also tried 1x5 and 1x6 strategies (i.e. 5 and 6 columns of a at a time), but this produced no clear advantage, plus requiring that KB be divisible by 5 or 6 in addition to 4 (required for alignment), effectively limiting KB to 60 (which happened to be about the best anyway). Code was also bigger, and in the 6 case, there aren't enough registers to handle beta well. I therefore didn't include these at the url above, but I cite the timings below anyway for general reference. ============================================================================= ATL_sgemm_SSE (1x4, KB%4=0, KB=NB=MB=M=N=compile time) 1870.535758 +- 6.330379 ATL_sgemm_SSE 60 s 0 moves= 1521.126667 +- 5.950514 ATL_sgemm_SSE 60 s 0 1868.306061 +- 6.258576 ATL_sgemm_SSE 60 s 1 moves= 1499.363636 +- 5.456918 ATL_sgemm_SSE 60 s 1 1852.698182 +- 4.940221 ATL_sgemm_SSE 60 s 2 moves= 1487.004242 +- 6.715148 ATL_sgemm_SSE 60 s 2 1836.563333 +- 2.189387 ATL_sgemm_SSE 60 c 0 moves= 1614.678182 +- 4.751711 ATL_sgemm_SSE 60 c 0 1830.222424 +- 2.821958 ATL_sgemm_SSE 60 c 1 moves= 1609.694545 +- 4.590589 ATL_sgemm_SSE 60 c 1 1813.752121 +- 5.435238 ATL_sgemm_SSE 60 c 2 moves= 1601.388485 +- 3.901250 ATL_sgemm_SSE 60 c 2 ============================================================================= ATL_sgemm_SSE_1x1xkb (KB%4=0, KB=lda=ldb, M,N runtime paramaters) (This can still be improved, just a relatively quick attempt) 1777.392727 +- 3.408984 ATL_sgemm_SSE_1x1xkb 60 s 0 moves= 1431.476061 +- 5.571626 ATL_sgemm_SSE_1x1xkb 60 s 0 1787.713939 +- 5.081814 ATL_sgemm_SSE_1x1xkb 60 s 1 moves= 1426.155455 +- 4.964096 ATL_sgemm_SSE_1x1xkb 60 s 1 1769.283030 +- 1.887701 ATL_sgemm_SSE_1x1xkb 60 s 2 moves= 1430.022727 +- 4.897580 ATL_sgemm_SSE_1x1xkb 60 s 2 1758.752727 +- 3.584418 ATL_sgemm_SSE_1x1xkb 60 c 0 moves= 1554.019394 +- 4.102429 ATL_sgemm_SSE_1x1xkb 60 c 0 1747.281818 +- 5.047235 ATL_sgemm_SSE_1x1xkb 60 c 1 moves= 1540.130909 +- 2.620997 ATL_sgemm_SSE_1x1xkb 60 c 1 1756.987879 +- 4.846072 ATL_sgemm_SSE_1x1xkb 60 c 2 moves= 1544.695758 +- 2.913440 ATL_sgemm_SSE_1x1xkb 60 c 2 ============================================================================= 1x5 strategy 1852.698182 +- 4.940221 1x5 60 s 0 moves= 1495.246364 +- 6.011740 1x5 60 s 0 1863.846667 +- 6.038048 1x5 60 s 1 moves= 1490.245758 +- 8.228506 1x5 60 s 1 1846.009091 +- 3.682223 1x5 60 s 2 moves= 1486.569394 +- 5.048818 1x5 60 s 2 1854.350000 +- 5.688195 1x5 60 c 0 moves= 1621.323030 +- 4.716382 1x5 60 c 0 1843.233333 +- 4.168469 1x5 60 c 1 moves= 1613.016970 +- 4.716382 1x5 60 c 1 1830.222424 +- 2.821958 1x5 60 c 2 moves= 1609.813030 +- 5.220205 1x5 60 c 2 ============================================================================= 1x6 strategy 1861.616970 +- 5.886441 1x6 60 s 0 moves= 1506.960000 +- 7.139862 1x6 60 s 0 1890.788485 +- 6.760493 1x6 60 s 1 moves= 1511.156970 +- 6.693307 1x6 60 s 1 1808.356364 +- 5.904534 1x6 60 s 2 moves= 1505.596364 +- 7.393735 1x6 60 s 2 1836.563333 +- 2.189387 1x6 60 c 0 moves= 1624.645455 +- 4.590589 1x6 60 c 0 1843.233333 +- 4.168469 1x6 60 c 1 moves= 1623.102727 +- 5.274856 1x6 60 c 1 1834.340000 +- 0.000004 1x6 60 c 2 moves= 1616.339394 +- 4.769277 1x6 60 c 2 ============================================================================= Take care, R Clint Whaley <rwhaley@cs.utk.edu> writes: > Camm, > > >I may be misgrokking something, but it seems to me that all this tells > >us is the relative alignment between subsequent rows of a and b. The > >starting a and b themselves need not have the same alignment, right? > >(what I mean by this is a%16 =/!= b%16) > > OK, so my understanding is that A%16 == B%16 == 0 forall routines, excepting > the K-cleanup cases. So, It's clear why the K-cleanup cases don't have > the required alignment, but I guess the M & N cleanup is less so. Here > is how I think it works: ATLAS now has a macro called ATL_MinMMAlign, which > is the minimum alignment A and B must have in order to use the copy-kernel. > ATLAS also dictates that (NB*sizeof()) % ATL_MinMMAlign == 0. When ATLAS > copies the matrices, the outer matrix is always copied into a panel of size > NB * K, and the outer matrix is in a workspace varying between > NB*K <= worksize <= [M,N]*K, depending on whether outer matrix is A (M) > or B (N). So, the inner panal is malloced with the correct alignment, > and the outer workspace has the correct alignment because NB is a multiple > of ATL_MinMMAlign, and so any multiple of NB (in this case, K*NB) is aligned > as well. For non-K cleanup, all blocks inside the panels are seperated > by NB*something blocks, so alignment between blocks is maintained. For > everything except K-cleanup, each column is of length NB, so all columns > are aligned as well . . . > > All these tricks break down for K-cleanup because of internal block alignment; > Each block begins on a multiple (since m=NB for A and n=NB for B), but the > internal columns are misaligned according to k. The cases where more than > one dimension is less than NB are not handled by user-supplied cleanup anyway, > so user-contributed cleanup never sees a case where blocks of A or B are > not multiples of ATL_MinMMAlign (in this case, 16). > > >So I guess the best strategy is to > >1) use a 1x4 > > Hmm, don't know. > > >2) give up on b alignment > >3) increment by 1 until a is aligned > >4) stride a by multiples of 4,2, or 1 depending on KB. > >Is this your understanding? I was wondering basically if I could get > >b alignment as well. > > Only for K-cleanup. M and N cleanup have the same alignment as full kernel. > For K-cleanup, A and B start with same alignment, but this doesn't help > you, since every column of A is applied to every column of B, and they > are all possible misaligned, so concentrating on getting one aligned as > you say is probably the trick. Essentially, you can then use aligned > loads of A, and unaligned loads of B. > > What Peter has done, and what I think makes a lot of sense, is having > cleanup for KB%4==0 cases which assumes alignment (and thus gets good > performance), and the non-aligned you discuss above for the KB%4 != 0 > cases . . . > > Cheers, > Clint > > -- Camm Maguire camm@enhanced.com ========================================================================== "The earth is but one country, and mankind its citizens." -- Baha'u'llah

**Follow-Ups**:**Re: Alignment of cleanup routines***From:*Camm Maguire <camm@enhanced.com>

**Re: Alignment of cleanup routines***From:*Camm Maguire <camm@enhanced.com>

**References**:**Alignment of cleanup routines***From:*R Clint Whaley <rwhaley@cs.utk.edu>

- Prev by Date:
**Re: config help** - Next by Date:
**atlas-comm4T >H3gGO<<?d?** - Prev by thread:
**Alignment of cleanup routines** - Next by thread:
**Re: Alignment of cleanup routines** - Index(es):