[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 3.3.10



R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Camm,
> 
> >This issue is that the gemm kernel I submitted, with following
> >dcases.dsc line:
> >200   8   4   1   4 1 1 4  1  4 ATL_gemm_SSE.c          "Camm Maguire"
> >
> >Is never timed for nb's as high as 80 on torc19, for example.  Timing
> >stops at much lower nb's, which, if I recall, you said was due to the
> >unreliable cache detection on the P4.
> 
> Ah yes.  Actually, the problem is the timing is too reliable :)  The P4
> actually has an 8K L1, but I don't think the FPU uses it (I know for sure
> this is true on IA64; I have found no proof on the P4, but the timings seem
> to indicate it), so the best case is in-L2 blockings, not in-L1.
> 
> We can fix this with an additional line in the dcases.dsc, as you say.  I'll
> look into it when I update the arch defaults for the P4 . . .

OK, let me know if you need suggestions.  I know what works on torc19,
but I don't know in general what would be the best cases to time.

> >1) On machines with little l2 cache, the k6 for example with 64KB,
> >   supplying make config with a cache size of 128 or 256 results in a
> >   build failure, as atlas can't get the timing parameters to within
> >   tolerance.  Does this make sense, or does this have nothing to do
> >   with cache size, but rather indicates some external machine load
> >   during the timing?  I tried to leave the machine quiet.
> 
> Tough to say.  But I thought K6's had a really odd L1, combined inst+data,
> 64K or something?  If so, instruction load could do some really strange things
> if your L2 is not big enough to catch overflow during OS interuptions.  Anyway,
> I really don't know . . .
> 

What's odd about this is that the build finishes if make config uses a
*larger* cache flushing size than appropriate for the present cache.
Build even completes if this figure is so large to cause swapping on
the system, although of course the timing results are then
meaningless.

So I take it though that to your understanding, there should be no
reason why flushing less cache could cause timing runs to fail to get
enough precision?  I know that flushing too little cache will not give
realistic numbers, but it would seem that at least they'd be
reproducible. 

> >2) request for config.c to output some simple line which a script
> >   could read indicating what ISA extensions are going to be used.  My
> >   understanding is that the current possibilities are
> >   sse,sse2,3dnow,ev5,sparc64. (If its easier, I could post this to
> >   sourceforge.) 
> 
> The line you want is in Make.inc's ARCHDEFS macro.  It defines the OS
> (eg Linux, SunOS, etc), the Arch (ATHLON, PIII, etc).  It also has an
> optional parameter that is an ISA extension.  The options on this are
> presently:
> 
>    -DATL_AltiVec
>    -DATL_SSE2
>    -DATL_SSE1
>    -DATL_3DNow2
>    -DATL_3DNow1
> 

OK, Thanks!  But, as we had discussed some time ago, doesn't the GOTO
code use instructions requiring an ev5 or higher, and isn't there some
sparc64 code somewhere?  Shouldn't these be on this list, in the sense
that they are code which will run on only some of the machines of the
general architecture?

> They are searched in that order (so if a machine has both SSE and 3DNow!,
> ATLAS would label it only SSE) . . .
> 
> >3) I can't close the sourceforge issue assigned to me, which was fixed
> >   with last night's cvs commit.  There appears to be no item on the
> >   web page allowing me to do so.  Have I missed something?
> 
> OK, that's because I have not added you and Peter to the admin list for
> the feature tracker.  What I was thinking was that I would close a given
> feature once I rolled it into a developer release, and you'd add a note
> saying "this is in, when you do the next devel release, it will be done".
> That's what I've been doing myself (keeping them open until a dev release
> fixing/adding them).  Does it seem like it would work better if I gave
> developers admin priviledges for the feature list?
> 

Not a big deal -- I just wanted to make sure you saw the fix.  I take
it the cvs commits and comments are routinely brought to your
attention by the sourceforge system.  No need to add me if its not
useful.  

> >4) I've got a few l1 kernels, but the cases lines seem a bit messy.
> >   For example:
> > 4  2  1  scal_44_SSE.c     "C. Maguire" 
> > 5  2  1  scal_45_SSE.c     "C. Maguire" 
> > 6  2  1  scal_46_SSE.c     "C. Maguire" 
>   ....
> 
> Ow.  Looks a little long.  We discussed this a bit before, and as we kind of
> hit on then, I think the best idea for right now is to find a couple of
> target cases, and only enter them.  I think we may want to think one day of
> a "normal" install, and a "give me the best no matter how long it takes"
> install.  For the latter, having a mechanism to simply retime all prefetched
> kernels with differing prefetch distances would be useful.  I've been thinking
> about maybe standardizing some macro names for various prefetch distances, to
> allow this kind of thing.  Still not ready for a concrete proposal, though . . .
> 

I've been mulling this over too, and, while its doubtless a feature
request for the future, I think the ideal situation would go something
like this.  Have C and assembler macros for prefetch which only take
one argument, C object or register, respectively.  The other
parameters, flavour and distance, would be controlled by atlas via cpp
#defines.  Then for each routine, get an estimate of the cycles per
cache line by blocking as much as will fit into l1 and letting the
code iterate.  With this number, time 3 prefetch distance cases, one
simply the time to load a cache line from main memory divided by the
processing time of a cache line, and the other two plus/minus one
cache line from this calculated value.  Would simplify things alot, it
would seem, as well as adapt atlas smoothly to machines with different
clock speeds.  Just an idea.  It is a bit cumbersome, as we had
discussed, as the processing time per cache line is very much routine
dependent.  But it should be a timeable quantity, no?

> Anyway, I think we ought to think about it for future releases, and just go
> with a couple of good cases for now . . .
> 
> >	These files are identical save the definition of two CPP
> >	macros indicating how far ahead to prefetch, and how far to
> >	unroll the loop.  Is there any cleaner way of telling the
> >	timer to simply take the file and time with ranges -DKB={beg
> >	to end} -DPF={beg to end}?
> 
> There's a real clean way to do this for the Level 2, and a slightly less clean
> for the Level 1.  The level 2 stuff is in the doc/atlas_contrib.ps of the
> latest dev release.  Getting the same functionality into the Level 3 and 1
> might be a good feature request (you, or anybody, should be able to submit,
> I think) . . .
> 

OK, I'll try to look at this.


> >5) You had indicated that you'd like all prefetch stuff to use you
> >   macros.  This can be done if you would supply an assember version
> >   in addition to the __asm__ wrapped one, e.g.
> >#define a_prefetch(a_,b_,c_) "prefetch" str(a_) " str(b_) "(" str(c_) ")\n\t"
> >   where a_ is the prefetch flavour, b_ is the offset, and c_ is the
> >   register containing the base address.  Or if you'd like to to
> >   macros with different names for the different flavours, that's
> >   obviously ok too.  I'm not sure if much mileage will be gained
> >   here, as all the stuff I'm writing these days is just sse code,
> >   which won't run anyway on athlon, etc.
> 
> Ah, contraire.  An AthlonMP has both SSE and 3Dnow!.  SSE is IEEE compliant,
> but slower, so it is not impossible that SSE-flops + 3DNow! prefetch would
> turn out the provide the best code :)
> 

:-)

> So, I think that it might indeed turn out to be useful to have cpp macros
> which do prefetch for the various assembly implementations, but I'm  not
> sure it is possible at that level of abstraction.  Not being an assembly
> guy, I'm not sure I can figure it out, but I figure at that level you are
> using actual register names, which will probably vary by arch . . .
> 

If you want, I could suggest a patch the the prefetch header file.
Again, for the future.

Take care,

> Cheers,
> Clint
> 
> 

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah