[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ATLAS on PIII



Hi,

Antoine forwarded some mail he'd exchanged with regarding the poor
performance you were getting on a PIII using ATLAS to the ATLAS
mailing list, atlas@cs.utk.edu.  Essentially, it looked like you
were getting 68% of peak, rather than the expected 72% (off-chip L2)
or .76% (on-chip L2).  I think I may have an idea what is going wrong.

The version of ATLAS on netlib does know about PIII's, and this is
bad news for the PIII's which have an on-chip cache 1/2 the size
of the PII's (which is what the present release of ATLAS thinks a PIII is).
So what you want to do is tell ATLAS to reexamine it's Level 2 cache
blocking, which is controlled CacheEdge.

To do this, go to your ATLAS/tune/blas/gemm/<arch>, and issue:
   make xdfindCE
   ./xdfindCE 

This program should spit out a bunch of output, ending in something like:
>Best CE=160KB, mflop=396.04

It's saying the best CacheEdge setting for my machine is 160KB; my guess
is yours will say 160 or something close to that.  Edit your
ATLAS/include/<arch>/atlas_cacheedge.h, and you'll probably see something
like:
#ifndef ATLAS_CACHEEDGE_H
   #define ATLAS_CACHEEDGE_H
   #define CacheEdge 262144
#endif

Change this to:
#ifndef ATLAS_CACHEEDGE_H
   #define ATLAS_CACHEEDGE_H
   #define CacheEdge 163840
#endif

Notice this is with my setting of 160*1024 for CacheEdge.  Now recompile
all needed files by going to ATLAS/bin/<arch>, and issuing:
   make xdmmtst xsmmtst xcmmtst xzmmtst

Then, let's see if you have the predicted 600Mflop now:
   ./xdmmtst -F 500

Send this output, or any questions, to atlas@cs.utk.edu.

Cheers,
Clint

> From Matthias Pester <m.pester@mathematik.tu-chemnitz.de>

> We are testing a 528 node Linux-Cluster with Pentium III-800 MHz, 
> 512 MB RAM each, and 100-Mbit-FastEthernet with high-performance 
> switches.

> So I took the opportunity to run  xdmmtst, where you guessed 
> 600 Mflops. I saw 
>    543,5 Mflops for N= 500, 
>    536,2 Mflops for N=1000,
>    529,8 Mflops for N=1500,
>    519,0 Mflops for N=2000, 
> (always a speedup 11 against the simple BLAS version)