[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Trying to build an SSE optimized sgemm kernel for Athlon...



Hi Everyone,
   I'm trying to build a matrix * matrix single precision ( AxB ) optimized
kernel for the Athlon.  However, I'm having problems getting high
throughput.  I thought maybe someone here could help me out.  I'm using SSE.

The kernel of my code.. involves multiplying a 64x64 submatrix of A
times a 64x64 submatrix of B.  The submatrices are prefetched into
cache.. and this kernel should fly at the speed of light.  Both
submatrix A and B are in L1.  My efforts to date are just for testing 
purposes, so the blocking factor of 64 is likely to change.  But for those
interested.. I have also tested blocking factors of NB = 36 and NB = 48.

I multiply 4 rows of submatrix A at a single time times a column of
submatrix B.Then I move to the next 4 rows of submatrix A... and so
on.  The entire multiplication of submatrix A times a "single" column
of B is completely unrolled.  Then I loop over the columns of B.

It's pivotal that I get "stellar performance" in the dot product 4
rows of submatrix A upon the 64 floats in the column of B submatrix. 
The data is arranged as such:

** register "edi" points to the first element of submatrix A

** register "esi" points to the column of submatrix B

Notes:
======
I bias the edi and esi registers by 128 bytes.. so I can sweep through
the entire 64 floats (256 bytes) of each row of A.  In this format:

[edi-128] == address of first element of first row of submatrix A
[edi+112] == address of last element of first row of submatrix A

SSE uses xmm registers and each contains 4 floats.. or 16 bytes.  So I
load 16 bytes at a time into the xmm registers.

Ok.. the code goes something like this:
=========================================================================
.
..
...

add edi,128
add esi,128
mov eax,256 ; size in bytes of a single row of submatrix A
mov ebx,768 ; size in bytes of a 3 rows of submatrix A

xorps xmm5,xmm5
xorps xmm6,xmm6
xorps xmm7,xmm7
xorps xmm8,xmm8

movaps xmm1,XMMWORD PTR [edi-128]      ; First 4 floats of row 1 of A
movaps xmm2,XMMWORD PTR [edi+eax-128]  ; First 4 floats of row 2 of A
movaps xmm3,XMMWORD PTR [edi+eax*2-128]; First 4 floats of row 3 of A
movaps xmm4,XMMWORD PTR [edi+ebx-128]  ; First 4 floats of row 4 of A
mulps xmm1,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 1 with col
mulps xmm2,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 2 with col
mulps xmm3,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 3 with col
mulps xmm4,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 4 with col
addps xmm5,xmm1            ; accumulate dot product of row 1 with col
addps xmm6,xmm2            ; accumulate dot product of row 2 with col
addps xmm7,xmm3            ; accumulate dot product of row 3 with col
addps xmm8,xmm4            ; accumulate dot product of row 4 with col

;  WE HAVE HANDLED 4 FLOATS now.. so we must load xmm registers 
;      with data 16 bytes in front of our previous accesses

movaps xmm1,XMMWORD PTR [edi-112]
movaps xmm2,XMMWORD PTR [edi+eax-112]
movaps xmm3,XMMWORD PTR [edi+eax*2-112]
movaps xmm4,XMMWORD PTR [edi+ebx-112]
mulps xmm1,XMMWORD PTR [esi-112]
mulps xmm2,XMMWORD PTR [esi-112]
mulps xmm3,XMMWORD PTR [esi-112]
mulps xmm4,XMMWORD PTR [esi-112]
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
addps xmm8,xmm4

movaps xmm1,XMMWORD PTR [edi-96]
movaps xmm2,XMMWORD PTR [edi+eax-96]
movaps xmm3,XMMWORD PTR [edi+eax*2-96]
movaps xmm4,XMMWORD PTR [edi+ebx-96]
mulps xmm1,XMMWORD PTR [esi-96]
mulps xmm2,XMMWORD PTR [esi-96]
mulps xmm3,XMMWORD PTR [esi-96]
mulps xmm4,XMMWORD PTR [esi-96]
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
addps xmm8,xmm4

movaps xmm1,XMMWORD PTR [edi-80]
movaps xmm2,XMMWORD PTR [edi+eax-80]
movaps xmm3,XMMWORD PTR [edi+eax*2-80]
movaps xmm4,XMMWORD PTR [edi+ebx-80]
mulps xmm1,XMMWORD PTR [esi-80]
mulps xmm2,XMMWORD PTR [esi-80]
mulps xmm3,XMMWORD PTR [esi-80]
mulps xmm4,XMMWORD PTR [esi-80]
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
addps xmm8,xmm4

.
..
...
=========================================================================


i'm not getting stellar performance in each 12 SSE instruction package
above.  In each package.. there are 32 floating point operations.  it's
taking me.. I believe 13 cycles to execute each of these instructions. 
Consequently.. the maximum throughput in FLOPS/CYCLE would be 32/13 =
2.46 FLOPS/CYCLE.  This is much to low.  does anybody see anything
wrong with how I've set up these instructions.  I do realize that the
first move instruction in each package is 4 bytes.. the other 3 are 5
bytes.. which means I can not decode more than 1 in any given clock
cycle.  Is this a problem, or can the Athlon only decode AND EXECUTE 1
movaps instruction per clock cycle.  

Any and all help is greatly appreciated.  I do not have a P4 and am not
familiar with it's capabilities.. though I wonder how many of the following
instructions:

movaps
mulps
addps

the p4 can execute in a given clock cycle.  
Thanks for any assistance...

tim wilkens

BTW.. this message has also been posted on comp.lang.asm here:

http://groups.google.com/groups?hl=en&group=comp.lang.asm.x86&selm=7b1e74d1.
0110211952.146ca68a%40posting.google.com