[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sgemm questions
> Peter's code (sent in email, couldn't get complex to work?):
My fault. Complex should work in the next release.
> A few comments:
> 1) I like Peter's idea of using a generator to write C code and then
> compile, better than my approach of having the cpp preprocessor
> generate assembly from defined macros. I'd originally adopted the
> latter because I couldn't get rid of register thrashing as gcc
> switched between its asm and mine, but Peter's code generates very
> clean assembly, and gcc always handles the loop overhead best. I
> was further a little concerned about the documentation, which seems
> to indicate that gcc is free to insert whatever it wishes between
> asm() calls. We can currently produce good asm using multiple
> asm() calls because a) gcc currently doesn't reference the extended
> registers, and b) if we don't reference the ordinary registers in
> the asm() explicitly, gcc's optimizer can do a good job of
> maximizing register use across asm() calls. If and when gcc
> starts emitting references to SSE/MMX registers, of course, things
> will have to change.
Yes, I was counting on the same thing: That gcc never touches the extended
registers, and I never touch the normal registers. This is my first
experience with writing gcc inline assembly, so if you have any comments
on the macros I would welcome it. I am a bit concerned with the macros I
have now, because I don't specify that I am using the extended registers,
so, as you say, it will break down one day with a newer compiler.
> 2) Peter's ideas of a) unrolling fully with KB ~ 56, b) 1x4 strategy
> c) loading C at the beginning rather than at the end and
> (shockingly) d) doing no pipelining at all all seem to be wins. I
> couldn't believe d) when I saw it, but its apparently true -- the
> PIII likes code like load(a) mul(b,a) add(a,c) best. Apparently,
> the parallelism between muls and adds mentioned by Doug Aberdeen in
> his earlier email only appears fully when the intermediary register
> is the same. Doug, maybe you can try this and see if you can get
> better than 0.75 clock? Or maybe I misunderstand you?
The reason that things work well with only one intermediate register might
be, that as soon as a new load into that register occurs, it is mapped to
another register, so you end up using a whole new register. I dont now how
good the PIII is for doing these things, or how many physical registers it
> 3) I noticed the practice of checking the loops at the end, so that the
> code fails if called with any length = 0. This seems reasonable,
> but I thought I'd point it out to ensure that atlas is making the
> calls accordingly.
> 4) I really only did three things, and a few minor cleanups, to
> Peter's code: a) shaved an instruction off the main block of 4
> multiplies, b) tightened the writing of C, and c) with these, and
> the elimination of a few extraneous instructions, increased the
> optimal KB to 60 or 64.
> 5) Peter, if you'd like to make these changes in your generator, and
> maintain this code or its equivalent, that would be just fine with
> me. You're doing a great job, and atlas is all the better for it!
Thank you. Please send the changes to me in some easy-to-read way, and I
will update the codegenerator. Thanks for your feedback.
> 6) I've got a cleanup too, which works but isn't fully optimized, if
> anyone would like to look at it.
I am working on the k-cleanup now, with an idea of yours, that Clint
mentioned: To loop over every 4th column of A, beacuse then you can use
aligned loads since they will be aligned the same way. Hopefully I can get
something working, but it seems to be the toughest problem to get good
> Take care,
> Camm Maguire email@example.com
> "The earth is but one country, and mankind its citizens." -- Baha'u'llah