[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sgemm questions



Hi Clint!

R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Camm,
> 
> >http://people.debian.org/~camm/gemm_20001115.tgz
> 
> Gottem.  I'll let you know when I have results.
> 
> >BTW, why 713, down from earlier 760?
> 
> The 760 was what I got using the kernel timer, and 713 is what I got
> timing the full SGEMM built on top of it.  I'm not sure I can explain the
> full 50 mflop difference; the data copy shouldn't continue to kill us as
> matrices get large.  However, the kernel timer rarely exactly predicts
> full-gemm performance (it over or under estimates depending on the arch) . . .
> CGEMM also peaks around 711Mflop for a 1120x1120 problem.
> 
> >Great!  Please let me know if you have this under control and no
> >longer need the kernels I've been submitting.  Then I'd have more time
> >for things like chasing down atlas compile errors on odd platforms
> 
> Peter has only given me one SSE kernel, and it was roughly the same
> speed as yours (maybe slightly slower, but not appreciably different).
> So, at the moment, my plan is to use your stuff for SSE and his for
> 3DNow!, as you both originally signed up for; it wasn't until you
> had each created your own kernel that you both apparantly decided to
> produce the complement . . .  My thought on duplicate submissions is that
> first one wins unless performance or some other mitigating factor
> intervenes . . .
> 

As far as I'm concerned, I'm for what ever is best for atlas.  I
certainly won't have hurt feelings or anything if the kernels I
submitted don't get in.  I'm just trying to avoid duplication of
effort and to conserve person-hours :-)! 

I've looked at the assembly produced by compiling Peter's generated C,
and it looks very good!  Its giving me a few ideas, but raises above
all one important question: Why is SSE so much worse than 3dNow!?  It
makes me think that we're missing something on the SSE front.  In
fact, I'm a little surprised that Peter's SSE code shouldn't have done
better than what I submitted, as the pipelining certainly seems
better.  My guess is that the Athlon really wants a mul followed
immediately by a different add (which reportedly can be done in one
cycle), whereas SSE prefers some non-fpu instruction(s) between these
two.  (I'm assuming here that Peter's SSE has the same instruction
order as the 3dNow!, though this could be false, of course.)  Or,
though I doubt it, it could be due to the way the kernel supports SSE,
as 3dNow!, unlike SSE, does not need kernel support.  Anyway, sure
would be great if we could figure out a way to bring SSE up to the
reported 2.5*clock reportedly being achieved on the Athlon.
Otherwise, it looks like we made the wrong CPU choice for our upcoming
Beowulf upgrade.

Take care,

> I've been using your kernel to help debug my new install procedure, and
> now that you've sent in cleanup, it should be doubly useful.  I can use
> the UltraSparc kernel similarly, but due to NFS and slower proc, the ultrasparc
> install is almost 10 times longer than the PIII on my laptop . . .
> 
> Thanks,
> Clint
> 
> 

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah