[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: binary installation issues (cont'd)



Carl,

>I think I will also poke around ATLAS to see what sorts of things
>you measure and whether it makes sense to add lmbench-3 equivalents.

ATLAS does only a very minimum amount of such general timings.  The reason
is that such information is only rarely globally applicable . . .

>By the way, I think you might be interested in the lmbench-2
>or lmbench-3 timing harness.  It does some work to determine
>the minimum amount of work necessary to obtain accurate results,
>Does adding this sort of functionality to ATLAS interest you?

So, ATLAS already has this kind of thing in a very crude way.  During the
FPU probe, loop counts are increased until a it takes long enough for us
to be guess timings will be accurate (somewhere like a second or .5 second),
and only then are the real timings done.  The FPU is assumed to run peak, and
it sets a atlas header file value telling all other timers how many flops
to do in order to ensure good timings.  This breaks down in a lot of ways
(if the FPU does not match peak, the flop count is too low, and if the
operation in question is much slower than the Level 3, it may be too high),
but this gives a least a rough estimate . . .

>Thanks for the detailed response.  I think I will have
>to give up on binary-only releases.  I will still think
>about how to speed up ATLAS' install time.  

Remember that the ATLAS install has to do a decent job on machines ranging
from my Mom's Pentium 133 up to IBMs multiple-proc on a chip, handling
CISC, RISC, VLIW, EPIC, etc.  Until you have tested it on 20 or so different
architectures, monkeying with the search heuristic can be risky.

>Some ideas for speeding up the ATLAS install time:
>	- shorten timing intervals in a controlled fashion
>	  (e.g. using a variant of the lmbench timing harness)
I think this is feasable at some point, particularly in association with
getting some very accurate, system-specific timers, but it requires some
legwork, and who knows when it'll happen.

>	- figure out some benchmarks which might help 
>	  predict the best configuration, to reduce the
>	  number of experiments that need to be done to
>	  identify the best configuration

Ah, yes, go back to a priori techniques rather than AEOS ones.  Not likely.

>	- figure out how do compilations in parallel (e.g.
>	  using "make -jN" and then test serially), although
>	  I think this will require a painful rewrite/integration
>	  of the various search algorithms since the search
>	  process is serial so you would need to conduct several
>	  searches in parallel
>	- ???

Never going to work.  In parallel, the timings will effect each other, as
they fight over memory bandwidth, skewing the serial optimizations.

>If I have time, I will try to create a prototype of the
>first idea and integrate it with one of the search 
>algorithms to show you how it would work.  Since I am
>leaving tonight on a two week business trip, it will have
>to wait until I return...

You can if you like.  However, understand that conservative does not even
begin to describe my approach to changing the heuristic.  The system we
have now is far from perfect, but I have a decent grasp of its shortcomings.
New changes affect various architectures in hard to predict ways.  If I could
remember the number of manhours spend backing out "improvements" I've made
to this search routine you might begin to grok my extreme reluctance to
mess much with this aspect of ATLAS . . .

The primary place I'm fairly confident the present mmsearch can be sped up
is in the register blocking search, where I think theory can give us a 
smaller subset of combinations to search.  Someday, when I have time to
test the resulting code on a battery of 20 or so archs, I'll put the change
in.

>I just looked at the new P4 documentation on Intel's website
>double precision data too.  They call it SSE 2, and you can
>	ftp://download.intel.com/pentium4/download/netburst.pdf

Carl, Carl, Carl.  You are talking to the man with:
   A) A high fiber intake
and
   B) An internect connection in the john 
      (http://www.cs.utk.edu/~rwhaley/Pers/equip.html)
If its about architecture, and mentioned on Ars Technicia or slashdot, I've
probably seen it :)


>I think you might be able to take the SSE 1 single precision
>kernel and quick modify it to also generate SSE 2 double
>precision stuff.  (If I can get a Linux P4 box, I think I would
>be willing to help.)

Yep, it's looking like that may be critical, since some of the reviews
seem to be saying the x87 FPU is not up to snuff even with the one present
in the PIII . . .  So, I am very interested in SSE2 code; AMD's fortcoming
Sledgehammer is supposed to use SSE2 as well as Intels product line . . .

Cheers,
Clint