[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: binary installation issues (cont'd)




> >	- Instead of using "-Aa -D_INCLUDE_POSIX_SOURCE"
> >	  or simply "-Aa" for HP flags, you should probably use
> >	  "-Ae" instead.
> 
> The man page on my decrepit HP-UX box does not discuss -Ae, 
> though cc seems to take it without warnings.  What does -Ae 
> mean?

"-Ae" is basically equivalent to "-Aa -D_INCLUDE_POSIX_SOURCE"
except that is much more succinct...

> >	- Add HP-UX specific code to discover number of CPUs
> 
> Thanks a lot for this code!  It will be in the next developer release.
> Can syscall also give the size of various caches, or other arch info?

"syscall()" is basically the generic means for doing a system 
call, so we are using an essentially undocumented system call
for extracting the number of CPUs...  

Looking through the various header files, I can see the data 
structures in the kernel that contain the interesting information, 
such as cache size, but there appears to be no interface to 
user-land for that data.  Sorry...

> >	- HP-PA machines generally only have an L1 cache,
> >	  they don't have an L2/L3 cache.  However, the L1
> >	  cache is usually between 256KB and 2MB for many
> >	  machines.  tune/sysinfo/L1CacheSize.c assumes
> >	  that L1 caches are at most 256KB.  I think a better
> >	  algorithm might be to use binary search where the
> >	  two end points are very small (say 1K) and very
> >	  large (say 2x MaxL2CacheSize).  The binary search
> >	  would be on the log2() size of the prospective cache
> >	  sizes.
> 
> OK, the first thing to know is that at the present time, the 
> Level 3 BLAS will get no advantage from pumping L1CacheSize 
> larger than 64K.  That's not to say it won't use larger L1 
> caches, but just that the L1CacheSize setting will not effect 
> performance if cranked above 64K.  It may have a good effect 
> on the Level 2 BLAS performance, though. I don't have time at 
> the moment to scope the changes you've made to the L1 cache 
> detector, but I'll try to scope it out at some point . . .

Even if it is not (currently) utilized by ATLAS, it is reassuring
to users to see the right numbers pop out of the system.  Also,
if ATLAS can make use of the larger L1 caches for Level 1 or
Level 2, then all the better...

On my HP780, which has a 1MB cache, the new L1CacheSize detector
reports 256KB.  This is not as bad as it sounds, since when you
look at the memory latency curves, the latency does jump up at
256KB and plateaus until 1MB where it takes another jump.  I
suspect that the HP780 may have a 4-way set associative cache,
and the jump in average latency is due to cache collisions once
you start using "too much" of the cache.  At any rate, the new
code still gives the right answer for the PII I have available.
So the new detector is accurately reporting the amount of cache
that is really available and fast, and ignoring that part of 
the cache where you start getting degraded performance.

As part of lmbench version 3 I have been working on developing
a micro-benchmark that probes the cache and memory hierarchy
to figure out various things, such as the number, size, line
size, and access parallelism of the various caches.  Also the
memory parallelism of main memory, and the TLB size.  Some
parts of the benchmark work better than others, but the single
biggest pain right now is discerning the number of caches and
identifying their 'edges' so as to infer their size.  One of
the things I found interesting was the substantial amount of
parallelism in modern memory subsystems, especially at the
cache levels.

lmbench version 3 also includes prototype benchmarks that attempt
to measure the latency of individual operations, such as {int,
uint64, float, and double} x {add, multiply, divide} and where
appropriate {div, rem}, as well as memory bandwidth for read
and copy, and a variety of other features you might find useful.

I think I will also poke around ATLAS to see what sorts of things
you measure and whether it makes sense to add lmbench-3 equivalents.

By the way, I think you might be interested in the lmbench-2
or lmbench-3 timing harness.  It does some work to determine
the minimum amount of work necessary to obtain accurate results,
and then it automatically and dynamically controls the looping
variable based on that information.  It looks like at least some
of the stuff in ATLAS uses fixed size loops, which need to be
large enough so that fast machines generate accurate results,
and yet small enough so that slow machines don't run forever.
If you are interested I can look at creating an ATLAS version of
the timing harness which builds with your clocks.  This might also
save time since many machines only need timing intervals of
5,000 micro-seconds, and I think your timing intervals are often
longer.  Also, the lmbench timing harnesses run the benchmark
several times and report either the median or minimum result,
which provides some robustness in the face of experimental error
or spurious system activity.

For example, a simple benchmark in lmbench-2 would look like:
	#include "bench.h"
	int main(int ac, char* av[])
	{
		BENCH(getppid(), 0)
		micro("getppid", get_n());
	}

Does adding this sort of functionality to ATLAS interest you?

> >I apologize --- I looked a bit more carefully at the search code,
> >and it looks like you minimize search time by using binary
> >search instead of a full grid search.  
> 
> It's not precisely a binary search, rather it is a heuristic that
> uses a variety of techniques in differing parts of the search.  An
> exhaustive search of the code generator's parameter space alone 
> would take weeks if not months . . .
> 
> >Given that a significant fraction of the time spent tuning ATLAS
> >is spent compiling, and given that you already use binary 
> >search, I think that building binaries for the full grid search 
> >would likely be prohibitively expensive.  Also, I think the
> >resulting installation package would be huge, since the 
> >aggregate size of all .o files for the full search space would
> >be overwhelming.
> >
> >If the package were distributed by CD-ROM, size wouldn't
> >matter so much, and since the build only happens once,
> >the huge build time would not be a showstopper.  However,
> >it feels a bit like a kludge to me...
> 
> I think you better change "CD-ROM" to DVD, at the miminum.
> I am willing to bet ATLAS in source + gccs for all known 
> platforms would be significantly smaller than the object 
> code for a full parameter space of the code generator . . .
> 
> >So, do you have any suggestions as to whether it might be
> >possible to extend/enhance/modify ATLAS to enable
> >distributors to build/redistribute binary packages of ATLAS
> >which could tune themselves with minimal delay and
> >recompilation on target machines?
> 
> I really don't think compilation is that big a part of the 
> cost in general.  I can't speak too much to HP-UX, since I 
> now have access to exactly one HP-UX machine, an ancient 
> HP9000/735, that does indeed take overnight *at least* to 
> do an ATLAS install.  The other over-nighter install is our
> SGI machine, which seems to spend most of it's time in "ar".  
> I always wonder how much of these slow-ass machine's problems 
> are caused by NFS.  Have you installed in /tmp or somewhere 
> like that?

Yes, I put things on local disk and it didn't make much
difference.  Watching the system via 'top' during an ATLAS
run it seems that things are mostly CPU bound (on my 
machines). 
 
> Unfortunately, I don't believe there is any way to do a 
> binary-only package of ATLAS with any real architecture 
> independance.  However, one targeted to a series of 
> related architectures might possibly be viable.  Again,
> I can't speak to HP-UX, but Intel provides an example of 
> this.  There is presently only a CacheEdge (L2-blocking) 
> difference between PII/PIII-external cache and PIII-on-chip 
> cache, so both these archs could be supported by 1 lib, and 
> if you have PIII with on-chip cache, you replace the 
> something like 2 routines that have CacheEdge.  However, 
> this quickly breaks down; for instance, we now have SSE 
> stuff in the developer releases, and SSE will work or not 
> depending on your kernel, so we have a new lib to get, and 
> rapidly we see the precompiled libs becoming very specific, 
> with less and less object overlap.  At this point, it makes 
> sense to just precompile libs for every conceivable arch, 
> and save yourself the complexity of building partial libs 
> with insane interdependencies.
> 
> So, for binary-only vendors, that is what I think I would do: 
> a binary-only distribution is by necessity more restricted 
> than a source-only, so I would be tempted to try to build the 
> complete, specific libs for all expected archs, and include 
> them all, with an install script that uncompresses only the
> correct one for your arch.  Obviously, this is not as flexible 
> as installing from source, but I don't see how do be so, without 
> putting in an incredible amount of work that would need to be 
> updated in lockstep with ATLAS changes . . .
> 
> Install time continues to rise as we add more and more 
> functionality to ATLAS (i.e., now we search Level 2 and 
> 3, and we plan to add Level 1, as well as extending 
> already existing stuff);  as this happens, I think we
> will need to allow users to guide how we optimize: i.e., 
> they indicate which routines to optimize, so we don't 
> spend 2 hours optimizing level 1 for a guy who doesn't 
> use it.  

Thanks for the detailed response.  I think I will have
to give up on binary-only releases.  I will still think
about how to speed up ATLAS' install time.  

Some ideas for speeding up the ATLAS install time:
	- shorten timing intervals in a controlled fashion
	  (e.g. using a variant of the lmbench timing harness)
	- figure out some benchmarks which might help 
	  predict the best configuration, to reduce the
	  number of experiments that need to be done to
	  identify the best configuration
	- figure out how do compilations in parallel (e.g.
	  using "make -jN" and then test serially), although
	  I think this will require a painful rewrite/integration
	  of the various search algorithms since the search
	  process is serial so you would need to conduct several
	  searches in parallel
	- ???
If I have time, I will try to create a prototype of the
first idea and integrate it with one of the search 
algorithms to show you how it would work.  Since I am
leaving tonight on a two week business trip, it will have
to wait until I return...

Cheers,

Carl