[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Math-atlas-results] SSE warnings, Band matrix request feature



Hi Clint!

R Clint Whaley <rwhaley@cs.utk.edu> writes:

> >1) so I take it the level 3 proposal was for an extension of the blas
> >   spec?
> 
> Yep, if you have banded and packed routines take a leading dimension and some
> starting/stopping criteria, you can write level-3 based kernels (which are
> slightly modified dense kernels for packed), and then use the same recursive
> algorithms as in dense.  This gives major speedups.
> 

Interesting.  Too bad about the proposal.  I'm sure it is a lot of
work, yes?

> >2) My comment was that the existing kernels would of course not work.
> >   Why can't (different) kernels be used with narrow band cases?
> 
> What I meant by this is that narrow-band guys are essentially Level 1 ops,
> which means optimization is not as good as 2 or 3, and that reusing kernels
> is difficult, because there are no low-order costs you can ignore (think
> matrix copy for Level 3, vector copy Level 2) . . .
> 

True.  Is there any way for the banded l2 for example to kick over to
small loops over level 1 routines with narrow matrices, kind of like
the small case test in gemm?

> >a) a[i]*=b[i]; (should be a ?sbmv with k=0)
> >b) a[i]+=const.
> 
> These would be additional Level 1 ops, not banded or packed, surely?  Adding
> them as additional Level 1 ops would not be hard with the templates already
> in place . . .

OK, but a) is also sbmv with k=0, no?  Seems to work, anyway.

You'd mentioned before that atlas treats these as ref blas.  Then why
this? (sse1 850MHz p3)

/usr/lib/atlas/xsl2blastst  -R tbmv -n 1000 -F 200

----------------------------- TBMV ------------------------------
TST# UPLO TRAN DIAG    N    K  LDA INCX   TIME  MFLOP  SpUp  TEST
==== ==== ==== ==== ==== ==== ==== ==== ====== ====== ===== =====
   0    L    N    N 1000    1    2    1   0.00  158.7  1.00 -----
   0    L    N    N 1000    1    2    1   0.00   36.6  0.23 PASS 

1 tests run, 1 passed

/usr/lib/atlas/xsl2blastst  -R gemv -n 1000 -F 200

------------------------------- GEMV --------------------------------
TST# TR    M    N ALPHA  LDA INCX  BETA INCY   TIME MFLOP  SpUp  TEST
==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
   0  N 1000 1000   1.0 1000    1   1.0    1   0.03  67.4  1.00 -----
   0  N 1000 1000   1.0 1000    1   1.0    1   0.01 238.7  3.54 PASS 

1 tests run, 1 passed

> 
> >c) a[i][j]-=b[i]+b[j]-const
> 
> What the hell kind of operation is this?
> 

Actually, a syr2k with 1 vector all ones.  In fact, it seems many of
these routines get invoked on a vector of ones somewhere, or is that
just me?  


One other item regarding possible precision.  I've written a
quadratic program using basically a loop over gemv and ger.  I've run
with both double and single precision, using both atlas and ref blas.
And it seems that atlas is losing a lot more precision than ref blas
for the single float case.  Here are my results (all these vectors
should be the same to within rounding):

=============================================================================
Ref blas, double precision:
=============================================================================
$3 = {0, 0, 0, 0.018940107898226873, 0.0066021159141143645, 0, 
  0.0098790949108835055, 0.002352625132218658, 0, 0, 0.043024871496419696, 0, 
  0, 0.003431704885730046, 0, 0, 0, 0.0028504331146380468, 0, 0, 
  0.0072203088464554126, 9.88852015697785e-05, 0, 0.0023821475938574058, 0, 0, 
  0, 0, 0, 0.013236189261078805, 0.020158668979830381, 0, 
  0.0061762775312050426, 0.0037328201000592612, 0.0089936656569935425, 
  0.0069173895768482232, 0, 0.0021318399813026172, 0, 0.014128699897621055, 
  0.004155492234810312, 0, 0, 0, 0.0053944443352520266, 0.011233760977118805, 
  0, 0.015775110574074071, 0.039780460298061364, 0, 0, 0, 0, 
  0.013780690365307539, 0.00013085617793068323, 0, 0.0056143347174945415, 
  0.0030021166895419344, 0.00017110725612916425, 0, 0, 0, 
  0.0022233377673914652, 0.0078112045315271994, 0.008062977599026782, 0, 
  0.0047622593146473503, 0, 0.0092151662037889982, 0, 0.013641357424147373, 
  0 <repeats 11 times>, 0.0023007066920399419, 0, 0, 0, 0.019783023745609019, 
  0, 0.0030098138926147877, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.004376432858798472}
=============================================================================
Ref blas. single precision
=============================================================================
$4 = {0, 0, 0, 0.0193224791, 0.00681497343, 0, 0.00986788794, 0.00191707187, 
  0, 0, 0.0434913486, 0, 0, 0.00334887533, 0, 0, 0, 0.00334869302, 0, 0, 
  0.00728768529, 1.79629933e-05, 0, 0.00284249382, 0, 0, 0, 0, 0, 
  0.0132361548, 0.0201619491, 0, 0.00567072676, 0.00403881213, 0.00907292403, 
  0.00605385005, 0, 0.00213182555, 0, 0.0145751694, 0.00393279782, 0, 0, 0, 
  0.00539445179, 0.0113568529, 0, 0.0154621108, 0.0397804491, 0, 0, 0, 0, 
  0.0135530541, 0.000135249895, 0, 0.00546122249, 0.00274914224, 
  0.000139613199, 0, 0, 0, 0.00207114615, 0.0076302751, 0.00818008091, 0, 
  0.00498831039, 0, 0.00935398508, 0, 0.0133951297, 0 <repeats 11 times>, 
  0.00230074138, 0, 0, 0, 0.0200425033, 0, 0.00290917116, 0, 0, 0, 0, 0, 0, 0, 
  0, 0, 0, 0.00403231615}
=============================================================================
atlas double precision
=============================================================================
$5 = {0, 0, 0, 0.018940107895699877, 0.0066021227977551469, 0, 
  0.0098791040760392978, 0.0023526251311276414, 0, 0, 0.043024871498807404, 0, 
  0, 0.0034317022092371905, 0, 0, 0, 0.0028504331110518455, 0, 0, 
  0.0072202984477313667, 9.888495348369022e-05, 0, 0.0023821475909385102, 0, 
  0, 0, 0, 0, 0.013236189261078857, 0.020158668979602598, 0, 
  0.0061762775267177909, 0.0037328174594609014, 0.0089936670203928423, 
  0.0069173895745572051, 0, 0.0021318399813026138, 0, 0.014128699899851334, 
  0.0041554912282962958, 0, 0, 0, 0.00539444433525209, 0.011233760973877238, 
  0, 0.015775110571483737, 0.039780460298061412, 0, 0, 0, 0, 
  0.013780692105380066, 0.00013085769806731604, 0, 0.0056143347156559715, 
  0.0030021166882750992, 0.00017110455923383634, 0, 0, 0, 
  0.002223337767391506, 0.0078112045288354984, 0.0080629767943949, 0, 
  0.0047622593115835016, 0, 0.0092151583019473293, 0, 0.013641357421016581, 
  0 <repeats 11 times>, 0.0023007066920400221, 0, 0, 0, 0.01978302374266707, 
  0, 0.0030098096180152987, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
  0.0043764328584068165}
=============================================================================
atlas single precision
=============================================================================
$6 = {0, 0, 0, 0.0198229793, 0.00542137539, 0, 0.00876359735, 0.00156316732, 
  0, 0, 0.0425283462, 0, 0, 0.00367816561, 0, 0, 0, 0.00305529614, 0, 0, 
  0.00622634217, 0, 0, 0.000910904317, 0, 0, 0, 0, 0, 0.0132361939, 
  0.021575097, 0, 0.00552129187, 0.0016044986, 0.00860607717, 0.00690866355, 
  0, 0.00213182229, 0, 0.0151669253, 0.00489541329, 0, 0, 0, 0.00539442617, 
  0.014026829, 0, 0.0175433327, 0.0397804528, 0, 0, 0, 0, 0.0139623322, 0, 0, 
  0.00566139678, 0.00161994447, 0.00017111942, 0, 0, 0, 0.00235463725, 
  0.00996533688, 0.00697260955, 0, 0.00575390365, 0, 0.0080448091, 0, 
  0.0121865831, 0 <repeats 11 times>, 0.00230072462, 0, 0, 0, 0.020749582, 0, 
  0.00329418154, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00432143174}
=============================================================================

Does this look nomal?  All libs pass all testers/timers.  Is it
possible that the sse1 fpu has fewer bits internally?  The normal fpu
does everything at 80 bits if I recall, right?

> >d) ffts
> Kind of a big topic.  What about fftw (http://www.fftw.org/)?  I've never
> used it myself (I can't spell fft), but have heard good things about it.
> Uses some of the same kinds of ideas as ATLAS, as I understand it . . .
> 

Thanks for the tip!

> By the way, you mentioned optimization of of non-unit stride vectors.  the
> speedups to be had are pretty meager, even in the best case (read only).
> My feeling is that 5% would be heroic.  Probably not worth thinking about
> except in exceptional cases.  For level 1 ops, memory bandwidth is the big
> constraint most of the time, and prefetch is the only real anelgesic.  From
> my limited experience, writing to the vector tends to kill the advantage of
> prefetch a great deal (I guess the bus is to busy to prefetch), so your
> big wins come on scalar-output routines like nrm2, iamax, ddot, etc.
> 

Agreed.  Have you looked into writing behind the reading pt at various
distances, and the different flavours of prefetch?  

Take care,

> Cheers,
> Clint
> 
> 

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah