[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: IA64 timings


>> You asked a while back whether anyone has gotten any decent results on this
>> machine.  Well, I finally have.  I'm getting roughly 70% of theoretical
>> peak now.  On the 666Mhz Itanium that Compaq has at their testdrive, the best
>> time I've seen on a full DGEMM is around 1845Mflop.  I include more complete
>> timings below.  The performance could probably be tweaked further, but
>> this will be good enough until after the release.
>I only succeeded in building ATLAS with gcc, which gave reasonable, but not
>great performance.
>ATLAS breaks SGI's C complier - I am still waiting for an updated version.
>As an interim measure, please can I have a copy of your binaries?
>Also which compiler did you use - I guess it was Digital's ?

I used gcc.  I have no access to docs or anything, so I had no special flags,
just plain gcc -O.  The first think I noticed is that this loop:
   for (k=K; k; k--)
Is roughly 50% faster than this loop:
   for (k=72; k; k--)
(assuming K=72).  This is one of the reasons the previous version of ATLAS
doesn't do so well: the search heuristic never thinks to substitute a runtime
variable for a compile-time constant (the assumption being that loop 2 should
always be at least as fast as loop 1).  Making ATLAS's loops run-time variables
raised the generated performance from 1Glop to 1.4Gflop.

However, the real key to performance on this machine is prefetching to
registers.  The generater presently has no real prefetch mechinisms, so
I wrote a user kernel, and that is what gets us to 1.8Gflop.

I talked with Bruce Greer of Intel, and it turns out that MKL is now sitting
around 80-83% of peak for DGEMM, so quite a bit better than ATLAS's roughly
70%.  However, I think 70% will do for now, and perhaps as I get access to
some compiler docs, etc., this can be pushed a little further . . .

>  Is not the therotical peak for singles:  2MMX x 4FPU x 666MHz = 5333 Mflops?

Not as far as I know.  MMX does not do floating point at all, for that you
need SSE, which IA64 does not have.  So, I believe that theoretical peak is
the same for single as for double, 4*Mhz.

>As an interim measure, please can I have a copy of your binaries?

Why don't you try the developer release?  I think that guy should get 
you all the performance we've got.  The stable release will be out before
Xmas . . .