[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UltraSPARC dgemm user contribution



Peter,

First, a general announcement for everyone: all atlas-comm mail is now being
archived at :
   www.netlib.org/atlas/atlas-comm

>These kernels are C codes best compiled with:
>	gcc -mcpu=ultrasparc -O -fomit-frame-pointer -mtune=ultrasparc ... 

Ah, this points to a problem in my kernel timer: the user needs to be able
to vary the MCC & MMFLAGS on a kernel-specific basis: you will certainly not be
the only one wanting a particular compiler and flags.  I will have to
change the index file to do it, but I think I can fix this . . .

>and were primarily written by a Viet Nguyen, who worked with me last year
>on (mainly complex) UltraSPARC BLAS (and did an excellent job too).
>The kernels use `lookahead over the level 1 cache' (equivalent to prefetching)
>so they can perform well for large blocksizes (eg 60-90).

I just hastily scoped out your kernel, and found the best performance
at NB=80.  More surprisingly, ATLAS itself does slightly better for
NB > 44 (the last size that completely fits in cache); my guess is
this is due to the L1 being 2-way associative; some stuff gets
knocked out anyway due to conflicts, etc., so overflowing the cache
is not a big deal when you have associativity and its corallary departure
from true LRU.  Still, does not explain why your kernel likes 80 so
much.  Any ideas?

>Their performance, when run on a 170 MHz Ultra, does not look terribly 
>fast when run from ATLAS:
>
>The L1 kernel:
>peter@kaffa make -e ummcase pre=d nb=40 mmrout=../CASES/ALT_ANUUltraL1mm.c beta=1
>...
>dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.510000, mflop=245.458824
>dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.510000, mflop=245.458824
>dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.530000, mflop=236.196226
>
>but for the equivalent test from my own test program (in which 
>everything is warm in cache and there is no copying), it gets over ~300
>mflops. 

It looks pretty darn good to me.  The atlas timers flush all the caches,
and try to call the kernel in an "expected" ways.  I compared your
kernel versus the best generated code on a 300Mhz Ultra2:

     UMM/ATL
NB    MFLOP   SPEEDUP
40   409/369   1.10
56   446/376   1.19
80   467/375   1.25

This is just the kernel, but a 20% speedup looks pretty sweet to me . . .
You know, as a child, no matter what the size of the piece of pie given
to me, I always checked the pan for any remainder; do you have a kernel
for other precisions as well? :)

>The full kernel generally shows  `speed ups' of > 1.0 in the
>except for small matrices. eg.

Did you find this faster than simply building an ATLAS with your kernel?
I'll do the experiments myself ASAP, but I'm curious . . .

Anyway, the next release is not yet out the door, so you are definitely
not too late.  I'll have this guy in the next developer release . . .

Thanks,
Clint