[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Intel SSE and Atlas.. some questions.
>I think I found a couple of bugs. The first may not be a bug:
>ATLAS finds the best generated matmult has nb=56
>I have a kernel with <mb>=1 <nb>=5 <kb>=16 in scases.dsc
>The GetUserNB() function in ATLAS/tune/blas/gemm/mmsearch.c returns 0
>in this case, hence rejecting my kernel. I fudged a fix it by setting
><mb>=-64 <nb>=-64 <kb>=-64.
Setting nb=5 insists that NB be a multiple of 5, which 64 is not. That
is why ATLAS is not using your kernel. If you handle the cleanup cases
where, for instance, nb%5 != 0 yourself, you want to set mb=nb=kb=1. I
think the variables you want to set are mu=1 nu=5 ku=16 . . .
>I think line 678:
> sprintf(ln, "CASES/%ccases.dsc", pre);
> sprintf(ln, "../CASES/%ccases.dsc", pre);
>Since mmsearch is running in the ATLAS/tune/blas/gemm/<arch> directory
>(and to match all the other instances of the same line).
Yeah, kinda hard to argue that's not an error. I've applied the fix to
my basefile, thanks.
>I think line 681:
> for (i=0; i <= icase; i++)
> for (i=0; i <= icase+1; i++)
>To take into account both the comment line and the line which
>specifies the number of user kernels when reading the best user kernel
>case from <pre>cases.dsc
Yep, another error. The routine is missing a:
fgets(ln, 256, fp); /* skip comment line */
before the for loop.
>I can give you my 1x5x16 matmult. I have that compiling into libatlas.a
>now. It gives an sgemm peak of about 650 MFlops (@450Mhz), not as fast
>as Emmerald, but a start.
OK, but no need to rush. I won't be able to scope out this part of ATLAS
for at least a couple of days (I'm in the middle of some unrelated work).
>I am doing that, but for anything larger than a 16X16 block,
>I need B[NB*4] which is larger than 128, thus cannot be directly
>translated to an indexed assembler instruction such as MOVUPS 48(%edx), %xmm0
I see. And I guess you can't, for instance, use only two columns of B, but
do each column with two accumulators, so that at the end you get
C[i,j] = c00a + c00b
or somesuch? I guess this is probably a bad idea, since it effectively
halves your K loop size, which I seem to recall from our previous discussions
as very important . . .
>> My own temptation is to give the two choices: kernel only, or drop in gemm.
>> The mixed case of changing some internal routines and providing the kernel
>> seems problematic, though I wouldn't rule it out.
>I see your point. Rather than hacking up internal routines it would
>be easier for me to implement better clean up code in emmerald, and use
>the drop in option. But I don't want to give up on creating an ATLAS
>kernel yet. I still have a couple of ideas.
Great. If you do supply your dropin replacement, nothing says you can't
call ATLAS's internals for cleanup. So if you really get hung on not being
able to get enough performance to satisfy you with the pure-kernel approach,
you could, for instance, supply the best kernel you come up, add your own
drop in SGEMM, and call ATLAS for the cleanup cases from your SGEMM . . .
Just to expand a bit on this point, even with a drop-in GEMM replacement,
ATLAS still uses its own generated small-case (non-copied) code, and finds
via timings where the supplied GEMM begins to beats the small-case code.
The best copied-kernel found by the search is actually still in the library,
it is just no longer called by the ATLAS interface routine, and thus all
the machinery is still there to be called by the user gemm, if required.
>I am currently trying to force ATLAS to compile my kernel with
><mb>=-160 <nb>=-160 <kb>=-160
>I am getting
>xemit_mm: ../emit_mm.c:874: regKloop: Assertion `i == ((i)/lat)*lat' failed.
If you are using a seperate multiply and add instruction, it is required
that (mu*nu*ku) % lat == 0, unless ku == KB. This is inherent in the pattern
matching necessary to do this kind of pipelining. You can avoid this
limitation by either setting your lat to an excepted value, or saying
your kernel uses a combined muladd.