[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

P4 throttling in LU



Guys,

I'm rather amazed, but throttling appears to be quite easy to force on the
P4.  In short, we seem to have it happen with SSE2, SSE1, or straight x86
code, and it appears that it can happen in full GEMM and LU codes!
LU is particularly suprising: my example code is calling sgetrf, which
should have a mixture of level 1 and 3 ops doing x86 FPU code, and Level 3
ops using SSE1.  It's on a 1000x1000 matrix, which won't completely fit in
the L2 cache, so I would expect memory fetch to be providing some rest for
the fpu as well.  Nonetheless, if run long enough, LU performance degrades
from 2.4Gflop down to around 1.7Gflop (takes about 8 min to degrade fully).
All I'm doing is successively timing sgetrf of a 1000x1000 identity matrix.
Note that just to be sure I'm not crazy, I ran the same code on my Athlon,
and experienced no slowdown regardless of how long I ran it . . .

I include below a tar.gz of the code necessary to get throttling calling
full GEMM and LU (assuming you have previously installed ATLAS).  My previous
mail shows how to demonstrate the problem using only a SSE1 kernel, and
Peter's original mail demonstrated it with SSE2.  Using the fc.c from my
previous mail, 
   make mmcase pre=d muladd=0 lat=1 nb=80 mu=1 nu=6 ku=80 moves=""
in ATLAS/tune/blas/gemm/<arch> does the same for x86 FPU kernel.

Anyway, the throttling of an actual code such as LU is far more alarming than
any of this kernel stuff.  My initial reaction is quite strongly negative,
in that it really makes performance of real codes questionable.  I guess one
thing in the favor of the P4 is that this particular machine is setting in
my office which is not extremely cold, but is air conditioned . . .

Clint

thr.tar.gz