This works fine assuming is real, otherwise must be applied explicitly as a complex scalar, and then set in the above outline.
Therefore, in order to use this trick, upon copying and to block-major storage, ATLAS also splits the arrays into real and imaginary components. The only matrix not expressed as two real matrices is then , and to fix this problem, ATLAS demands that the complex L1 matmul stride by 2. An example will solidify the confusion.
A simple 3-loop implementation of an ATLAS complex L1 matmul is:
void ATL_USERMM (const int M, const int N, const int K, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc) { int i, j, k; register double c00; for (j=0; j < N; j++) { for (i=0; i < M; i++) { #ifdef BETA0 c00 = 0.0; #else c00 = C[2*(i+j*ldc)]; #ifdef BETAX c00 *= beta; #endif #endif for (k=0; k < K; k++) c00 += A[k+i*lda] * B[k+j*ldb]; C[2*(i+j*ldc)] = c00; } } }
First we test that it produces the right answer:
make cmmutstcase pre=z nb=40 mmrout=../CASES/ATL_cmm1x1x1.c beta=0 make cmmutstcase pre=z nb=40 mmrout=../CASES/ATL_cmm1x1x1.c beta=1 make cmmutstcase pre=z nb=40 mmrout=../CASES/ATL_cmm1x1x1.c beta=8
Then we scope the awesome performance:
make cmmucase pre=z nb=40 mmrout=../CASES/ATL_cmm1x1x1.c beta=1 zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251 zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251 zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251
Now, since we are clearly gluttons for punishment, we compare our masterwork to ATLAS's generated kernel:
speedy. cat res/zMMRES MULADD LAT NB MU NU KU FFTCH IFTCH NFTCH MFLOP 0 5 36 2 1 36 0 2 1 186.42 10 speedy. make mmcase muladd=0 lat=5 nb=36 mu=2 nu=1 ku=36 beta=1 dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.500000, mflop=195.581952 dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420 dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420