[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PIII kni l2 kernels, all precisions


Sorry for the delay, I've been on vacation, followed by conference,
followed by conference, and am just now catching up on the backlog . . .

>Greetings!  I've put the latest PIII level2 stuff at 

I've downloaded the stuff, but haven't scoped it out yet; the speedups
you mention look great.  I am hoping to concentrate on adding the ability
for user-contributed cleanup code in GEMM, now that Peter and Viet Nguyen
have given me an example where it makes sense.  I would hope to have
the next developer release, including your stuff, when that is rolling . . .

>2)  The gemv have a unified source file for all precisions, and
>    transpose/no transpose.
>	a) It might be a good idea to have a -DTRANSPOSE on the
>	    compile line, i.e. like -DSREAL, etc.

I didn't do this because gemvN can be a fundamentally different operation
than gemvT, in particular because it often pays to use a axpy-based code
for gemvN.  But, as you say, it is a judgement call, and could certainly
save some code in some cases . . .

>	d)  I thought this might speed development when doing the
>	    merge originally, but am not so sure now.  Opinions as the
>	    the best way to manage this stuff are appreciated.

I'm not sure I understand the question, but I usually develop double and
double complex kernels type-specific, and then generalize them for the
SREAL and SCPLX; this obviously doesn't work for you since SREAL is 
fundamentally different than DREAL for KNI stuff; it's not apparent
to me that having type by a cpp macro saves you any time at all when the
precisions are different, in which case I'd be tempted to just develop the
typed routines, and satisfy ATLAS's cpp expectations in the most
straightforward way . . .

>	a) I don't suppose kni level1 routines are of any value?

At the moment, ATLAS has no support for optimized Level 1 at all, much
less user contribution.  I'm afraid it will be a while before we will
find the time to add this support explicitly . . .

>I'd like to take a shot at a gemm kernel when I get back.

That will be interesting, and all the more reason for me to get the cleanup
fixed . . .

>PS.  Has anyone seen the following performance comparison with atlas,
>and/or have any comments?

Interesting.  The only real bias I see from the numbers is they iterate
on multiples of Greg Henry's block size, and they use ATLAS 2.0, which
is not as good at small problems and cleanup as the newest stuff.  My
assumption is that they are using Greg's GEMM: you will notice you cannot
download the gemm code from that page, and the rest of the codes are
gemm-based BLAS, so there is no way to verify the numbers.  In general,
ATLAS tends to beat Greg's released PII code, so I have some problems
believing these numbers would hold up in reality.  However, for their
largest problem (512x512), ATLAS gets around what I would expect for
a 233Mhz PII;  if I had the time, I'd benchmark Greg's code at these sizes
and see if it really does get these numbers: Greg's cleanup code is not
very good, so it is possible that he does indeed beat ATLAS on all multiples
of 32, but just loses to ATLAS on the rest (or at least, he used to,
back when I benchmarked the two codes); that would explain why my timings
usually show ATLAS beating Greg's: I usually time 100-1000 in multiples of
100 . . .