[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: developer release 3.1.2



Camm Maguire wrote:

> Greetings!
>
> 1) It looks as though the prefetch assisted double precision level2
>    will max out at about 50% + standard atlas.  Transpose: 94 ->140,
>    Notrans: 67 -> 97.  dger remains to be completed.  Basically, I
>    just looked at the atlas compiled assembler, and added prefetch.
>
>    So the rule of thumb appears to be SIMD +50%, prefetch +50%,
>    both +100%.

Cool!

> 2) If anyone has a handy reference for the Athlon SIMD instructions, I
>    think these routines will port over to that platform with minimal
>    change. We even have an Athlon that I can try out :-)

No reference, but you might want to check out U.Ky's "KLAT2", which though a
relatively small install as beowulfs go (64 Athlons), has two impressive tools.
The relevant one is a couple of (forgot how many) Athlon asm blas routines, they
claim 3x improvement over non-asm, but I don't know what the non-asm was.  (e.g.
I get 3-8x improvement on Celeron 333 going from blas1 to atlas!)

The other less relevant tool is a cool genetic algorithm (free source download)
to calculate optimal network topology, so each node on your network sees each
other through only one switch.  (Multiple NICs/box.)

http://www.ars-technica.com/cpu/2q00/klat2/klat2-1.html

> 3) I do hope we can find a solution for distributed atlas binaries.  I
>    know the idea is for the user to build atlas on each platform they
>    will use, and that the current tree will skip any routines which
>    fail to compile on a given platform, (i.e. if there is no SIMD
>    support).  Serious users will do this no doubt.

Oops.  Thought I was a serious user.  Oh well. :-)

Why not just include multiple optimized binaries for different arches in separate
dirs, e.g. /usr/lib/atlas-athlon etc., then use a postinst script to scan
/proc/cpuinfo, move the appropriate libs into /usr/lib, and remove the dirs?  It
would increase the package download size (and unpack time), but not installed
size, and everyone would get optimized local libs.  You'd have to make the source
package assemble multiple binaries with different options, and get it right for
different arches, e.g. 603e/604e/750/G4 PPC, ev4/ev5/ev56/ev6 alpha, etc.

Of course, this would be Linux-specific, and would only work for Hurd if it has a
Linux-compatible (-identical?) /proc/cpuinfo, this is beyond my knowledge.  If
there's ever a Debian FreeBSD, you'd have to do something else. :-)

>    But I do maintain an atlas package for Debian, and I've found that,
>    while the distributed library is obviously not completely optimal,
>    it is very frequently significantly better than the reference blas,
>    and gives new users a quick way to try atlas out to see if its
>    worth their while. (The Debian package provides an atlas drop-in
>    shared library replacement for the standard blas, so one can
>    compare performance gains at runtime simply by setting the
>    LD_LIBRARY_PATH environment variable.)

Oh, so that's why you do that. :-)

Here's my autoconf configure.in section to check for atlas, then ordinary blas
(hope my mailer doesn't wrap it too bad), for the autoconf users out there:

dnl Check for (ATLAS) BLAS library

aLIBS="$LIBS"
LIBS="$aLIBS -latlas $MATH_LIBS $FLIBS"
AC_CHECK_LIB(f77blas, dgemm_, BLAS_LIBS="-lf77blas -latlas $MATH_LIBS $FLIBS",[
  LIBS="$aLIBS $MATH_LIBS $FLIBS"
  AC_CHECK_LIB(blas, dgemm_, BLAS_LIBS="-lblas $MATH_LIBS $FLIBS",
    AC_MSG_ERROR(BLAS basic linear algebra subroutines are not installed on
this platform.  Please see the documentation for download locations.))])
AC_SUBST(BLAS_LIBS)
LIBS="$aLIBS"

>    One solution is to compile several versions covering the most
>    common platforms.  Perhaps this is best.  However, this strategy
>    has the potential for confusion, both at compile time, and for
>    users inadvertently installing the wrong binary, getting a crash,
>    and filing a bug.

You'd have to automate the installation, see above.  Parsing /proc/cpuinfo could
be tricky if you want something that works on all CPUs.  But you could try
announcing it to lists like this, and get people to test and patch it as widely
as possible, then put it in woody so only the daring will try it, to minimize
your headaches along the way to an everywhere-optimized binary.

I'm sure atlas is not the only package interested in this capability, you'd
probably find such a script being used pretty widely.  It might even become part
of the Debian package format, to let anyone "easily" generate locally-optimized
packages.  Maybe it could even be used for kernels: a single package could have
optimizations for multiple CPUs, and for SMP and non-SMP, and just install the
right modules!

Okay, maybe doing the kernel this way would be a touch impractical.  Doing this
Debian-wide might involve subarch-specific packages, e.g.
xxx_ver-debver_i386-crusoe.deb or _alpha-ev6-smp.deb, which would eliminate the
big downloads and unpacks but bloat the mirrors quite a bit, especially for
kernels.

Of course all of this is idle speculation which probably belongs on one of the
Debian policy lists.  But I don't think anyone would mind if you did multiple
arch-specific atlas libs in a single package, with a postinst to select the right
one, especially with public testing before the first upload.

> Another option is to have a flag somewhere in
>    the build process specifying "compiled code only" or some such.
>    Then we could leave the SIMD gains to the serious users and maybe
>    provide a note to this effect in the docs accompanying the package.
>    I really don't know what to do, I just thought I'd mention it.

Does anyone know how Microsoft/Adobe/etc. ships software optimized for the latest
chips, which doesn't break on old ones?  E.g. the Altivec instructions in
Photoshop which only are exercized on G4s...

> 4) Do we have an idea as to when we might want to release a
>    SIMD-enhanced atlas, say in Debian?  Is there any word on the most
>    important level3 front?

Sorry, don't know what a "level3 front" is. :-)

Some food for thought...

-Adam P.