[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
IA64 explained, 3.3.5
Guys,
OK, trying to set a record for number of releases, I've just posted 3.3.5.
This gets rid of trtri out of lapack, improves IA64 complex performance,
and fixes a bug in the complex Cholesky tester.
I have figured out what was going on that I got no speedup with my new
kernel on the IA64. If you recall, 3.3.3 (which started all this quick
release madness) was supposed to be a IA64-improving release, due to IA64
prefetch, but when I timed it on machines I wasn't NDAd on, I got no
performance improvement. Even though it used the same compiler as my
NDAd machine, I got strange compiler problems as well.
Turns out the problem is that on the TestDrive machine, they have two different
compilers, and my 3.3.3 build was using a mixture of RedHat's baaaad gcc, and
the much better gcc 3.0.
So, this is the first performance hint for IA64: make sure you use gcc 3.0
everywhere in your ATLAS install: change all C compilers defined in your
Make.<arch> to explicitly reference it, and change all gcc refs in
ATLAS/tune/blas/gemm/CASES/?cases.flg as well.
Once this was done, I got the performance shown below. What we see is that
prefetch does not make a big performance improvement (3.3.2 and 3.3.4 are
almost the same speed asymptotically), but that the improved cleanup code
I wrote definitely helps small problems.
Prefetch definitely helps the Level 1 and 2 BLAS performance; the bad news
is that even the new performance is signally poor. This is because we have
no IA64-specific kernels for Level 1/2; the improvement is simply using the
best general kernel with prefetching enabled . . .
The timings on a 800Mhz IA64 are included below, all for double precision.
I do not have access to non-NDAd MKL; if anyone does, I'd love to see some
comparisons . . .
Cheers,
Clint
Timings for double precision, comparing ATLAS 3.3.2 vs. 3.3.4, all on a
800Mhz IA64. The performance of 3.3.4 is same as 3.3.5 for double precision
(3.3.5 is faster for complex; complex timings are not shown).
100 200 300 400 500 600 700 800 900 1000
====== ====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.2 dMM 1024.0 1512.4 1783.7 1846.1 1896.3 2076.8 1973.2 2084.6 2102.8 2104.8
3.3.4 dMM 1061.1 1524.1 1803.1 1927.5 1969.2 2029.2 2081.6 2072.3 2126.8 2135.6
1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
====== ====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.2 dMM 2112.8 2129.5 2192.0 2222.1 2180.5 2136.3 2189.1 2159.2 2236.1 2218.9
3.3.4 dMM 2155.3 2144.9 2171.5 2206.1 2205.7 2194.5 2220.9 2218.9 2223.6 2229.9
GEMM SYMM SYRK SYR2K TRMM TRSM
===== ===== ===== ===== ===== =====
3.3.2 d100 967.9 962.4 627.4 862.9 677.2 490.4
3.3.4 d100 1019.9 1153.2 710.1 891.6 732.5 636.8
3.3.2 d500 1889.3 1723.9 1452.0 1777.8 1514.8 1245.7
3.3.4 d500 1939.4 1729.7 1590.0 1718.1 1501.5 1402.7
3.3.2 d1000 2117.9 1917.6 1653.3 1935.7 1790.2 1526.1
3.3.4 d1000 2155.8 1823.7 1677.6 1932.1 1701.0 1528.4
GEMV SYMV TRMV TRSV GER SYR SYR2
====== ====== ====== ====== ====== ====== ======
3.3.2 d500 122.4 225.6 113.4 109.4 39.2 47.3 61.0
3.3.4 d500 130.1 245.2 170.1 151.5 160.1 107.1 156.9
3.3.2 d1000 166.0 231.3 101.0 97.3 37.3 37.4 52.1
3.3.4 d1000 214.1 208.7 194.3 180.0 172.2 115.3 165.7
ROTM SWAP SCAL COPY AXPY DOT NRM2 ASUM AMAX
====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.2 d500000 72.5 28.6 18.6 51.3 29.5 35.1 18.3 47.5 49.3
3.3.4 d500000 77.8 33.6 39.4 50.8 133.0 82.3 96.1 180.9 120.8