[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sgemm questions
> 2) Peter's ideas of a) unrolling fully with KB ~ 56, b) 1x4 strategy
> c) loading C at the beginning rather than at the end and
> (shockingly) d) doing no pipelining at all all seem to be wins. I
> couldn't believe d) when I saw it, but its apparently true -- the
> PIII likes code like load(a) mul(b,a) add(a,c) best. Apparently,
> the parallelism between muls and adds mentioned by Doug Aberdeen in
> his earlier email only appears fully when the intermediary register
> is the same. Doug, maybe you can try this and see if you can get
> better than 0.75 clock? Or maybe I misunderstand you?
The following sequence gets 0.84 IPC, an improvement over 0.75, and
the best performance to date:
Note that each pair of MULPS ADDPS don't have dependencies on
ajacent sets. If they do, the IPC drops to 0.29, which is for the following
There's pipeline stalls all over this one. I don't quite understand
how the first code does so well, since there should be a stall between
each MUL and ADD. They must do something funky in the hardware, or
perhaps the instruction re-ordering works really well in this case.
-Doug -- http://beaker.anu.edu.au, Ph:(02) 6279-8608, Fax:(02) 6279-8651
A pessimist is just a realist who has not been proved right... yet.