[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 3.3.10



Camm,

>This issue is that the gemm kernel I submitted, with following
>dcases.dsc line:
>200   8   4   1   4 1 1 4  1  4 ATL_gemm_SSE.c          "Camm Maguire"
>
>Is never timed for nb's as high as 80 on torc19, for example.  Timing
>stops at much lower nb's, which, if I recall, you said was due to the
>unreliable cache detection on the P4.

Ah yes.  Actually, the problem is the timing is too reliable :)  The P4
actually has an 8K L1, but I don't think the FPU uses it (I know for sure
this is true on IA64; I have found no proof on the P4, but the timings seem
to indicate it), so the best case is in-L2 blockings, not in-L1.

We can fix this with an additional line in the dcases.dsc, as you say.  I'll
look into it when I update the arch defaults for the P4 . . .

>BTW, the docs don't specify how you define the muladd and latency
>parameters.  Could you please explain briefly?  Some of the code I've
>submitted may have non-conventional values.

This turns out to be useless information in ATLAS right now.  I originally
specified it because it is very important when the code is generated, but
not when it is prewritten.  Latency is the FPU latency: the number of cycles
you wait until you use the output of a flop.  Muladd is 0 if you program using
seperate add and multiply instructions (usually sepeerated by lat instructions),
or 1 if you use the combined multiply add instruction (c += a*b) . . .

>1) On machines with little l2 cache, the k6 for example with 64KB,
>   supplying make config with a cache size of 128 or 256 results in a
>   build failure, as atlas can't get the timing parameters to within
>   tolerance.  Does this make sense, or does this have nothing to do
>   with cache size, but rather indicates some external machine load
>   during the timing?  I tried to leave the machine quiet.

Tough to say.  But I thought K6's had a really odd L1, combined inst+data,
64K or something?  If so, instruction load could do some really strange things
if your L2 is not big enough to catch overflow during OS interuptions.  Anyway,
I really don't know . . .

>2) request for config.c to output some simple line which a script
>   could read indicating what ISA extensions are going to be used.  My
>   understanding is that the current possibilities are
>   sse,sse2,3dnow,ev5,sparc64. (If its easier, I could post this to
>   sourceforge.) 

The line you want is in Make.inc's ARCHDEFS macro.  It defines the OS
(eg Linux, SunOS, etc), the Arch (ATHLON, PIII, etc).  It also has an
optional parameter that is an ISA extension.  The options on this are
presently:

   -DATL_AltiVec
   -DATL_SSE2
   -DATL_SSE1
   -DATL_3DNow2
   -DATL_3DNow1

They are searched in that order (so if a machine has both SSE and 3DNow!,
ATLAS would label it only SSE) . . .

>3) I can't close the sourceforge issue assigned to me, which was fixed
>   with last night's cvs commit.  There appears to be no item on the
>   web page allowing me to do so.  Have I missed something?

OK, that's because I have not added you and Peter to the admin list for
the feature tracker.  What I was thinking was that I would close a given
feature once I rolled it into a developer release, and you'd add a note
saying "this is in, when you do the next devel release, it will be done".
That's what I've been doing myself (keeping them open until a dev release
fixing/adding them).  Does it seem like it would work better if I gave
developers admin priviledges for the feature list?

>4) I've got a few l1 kernels, but the cases lines seem a bit messy.
>   For example:
> 4  2  1  scal_44_SSE.c     "C. Maguire" 
> 5  2  1  scal_45_SSE.c     "C. Maguire" 
> 6  2  1  scal_46_SSE.c     "C. Maguire" 
  ....

Ow.  Looks a little long.  We discussed this a bit before, and as we kind of
hit on then, I think the best idea for right now is to find a couple of
target cases, and only enter them.  I think we may want to think one day of
a "normal" install, and a "give me the best no matter how long it takes"
install.  For the latter, having a mechanism to simply retime all prefetched
kernels with differing prefetch distances would be useful.  I've been thinking
about maybe standardizing some macro names for various prefetch distances, to
allow this kind of thing.  Still not ready for a concrete proposal, though . . .

Anyway, I think we ought to think about it for future releases, and just go
with a couple of good cases for now . . .

>	These files are identical save the definition of two CPP
>	macros indicating how far ahead to prefetch, and how far to
>	unroll the loop.  Is there any cleaner way of telling the
>	timer to simply take the file and time with ranges -DKB={beg
>	to end} -DPF={beg to end}?

There's a real clean way to do this for the Level 2, and a slightly less clean
for the Level 1.  The level 2 stuff is in the doc/atlas_contrib.ps of the
latest dev release.  Getting the same functionality into the Level 3 and 1
might be a good feature request (you, or anybody, should be able to submit,
I think) . . .

>5) You had indicated that you'd like all prefetch stuff to use you
>   macros.  This can be done if you would supply an assember version
>   in addition to the __asm__ wrapped one, e.g.
>#define a_prefetch(a_,b_,c_) "prefetch" str(a_) " str(b_) "(" str(c_) ")\n\t"
>   where a_ is the prefetch flavour, b_ is the offset, and c_ is the
>   register containing the base address.  Or if you'd like to to
>   macros with different names for the different flavours, that's
>   obviously ok too.  I'm not sure if much mileage will be gained
>   here, as all the stuff I'm writing these days is just sse code,
>   which won't run anyway on athlon, etc.

Ah, contraire.  An AthlonMP has both SSE and 3Dnow!.  SSE is IEEE compliant,
but slower, so it is not impossible that SSE-flops + 3DNow! prefetch would
turn out the provide the best code :)

So, I think that it might indeed turn out to be useful to have cpp macros
which do prefetch for the various assembly implementations, but I'm  not
sure it is possible at that level of abstraction.  Not being an assembly
guy, I'm not sure I can figure it out, but I figure at that level you are
using actual register names, which will probably vary by arch . . .

Cheers,
Clint