[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sgemm questions
> 2) Peter's ideas of a) unrolling fully with KB ~ 56, b) 1x4 strategy
> c) loading C at the beginning rather than at the end and
> (shockingly) d) doing no pipelining at all all seem to be wins. I
> couldn't believe d) when I saw it, but its apparently true -- the
> PIII likes code like load(a) mul(b,a) add(a,c) best. Apparently,
> the parallelism between muls and adds mentioned by Doug Aberdeen in
> his earlier email only appears fully when the intermediary register
> is the same. Doug, maybe you can try this and see if you can get
> better than 0.75 clock? Or maybe I misunderstand you?
The following sequence gets 0.84 IPC, an improvement over 0.75, and
the best performance to date:
MULPS(0, 1);
ADDPS(1, 2);
MULPS(3, 4);
ADDPS(4, 5);
MULPS(6, 7);
ADDPS(7, 0);
MULPS(3, 4);
ADDPS(4, 5);
Note that each pair of MULPS ADDPS don't have dependencies on
ajacent sets. If they do, the IPC drops to 0.29, which is for the following
code:
MULPS(0, 1);
ADDPS(1, 2);
MULPS(2, 3);
ADDPS(3, 4);
MULPS(4, 5);
ADDPS(5, 6);
MULPS(6, 7);
ADDPS(7, 0);
There's pipeline stalls all over this one. I don't quite understand
how the first code does so well, since there should be a stall between
each MUL and ADD. They must do something funky in the hardware, or
perhaps the instruction re-ordering works really well in this case.
--
-Doug -- http://beaker.anu.edu.au, Ph:(02) 6279-8608, Fax:(02) 6279-8651
A pessimist is just a realist who has not been proved right... yet.