[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Timing roundup (P4, IA64, Athlon)



Hi Clint!  This looks great!  Congratulations!

R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Guys,
> 
> I include below some interesting timings, where I compare the following three
> systems, all running linux, and using a default ATLAS 3.2.0 install:
> 
> ATH : 1Ghz Athlon, SDRAM                           $1269
> P4  : 1.5Ghz Pentium 4, Rambus                     $2109
> IA64: 666Mhz Itanium, no idea on mem               ?????
> 
> So, the first thing to note is that the Athlon is using the old memory
> type (SDRAM, not the newer SDDRAM, or whatever the hell it is), and the
> P4 is using rambus.  I have no idea what the Itanium has.  The price above
> is not what we payed for the machines (I have no idea), it's what Gateway
> tells me those machines with 256Mb of memory cost.
> 
> All the numbers here are using the P4's normal FPU.  This machine will
> need SSE2 to really shine (that will pump it's theor peak to 2*mhz).  However,
> the normal FPU is what you get just using gcc on linux, so it's what linux
> people will be getting for a while, as well as MSVC++ people (Intel has a
> compiler for Windows that apparently generates SSE2 code automatically,
> which is what MKL is apparently already using to get dmatmul > Mhz).
> 

Should be very little work (comparatively) to port the current SSE
stuff to SSE2.  Problem is, I don't have access to any such machine.
If you're ever interested in this project and would like my help (and
if I can find the time :-)), I'd love to see atlas shine in this area
too and could perhaps assist if you could provide ssh access to a
linux p4 somewhere.  Or I'm sure Peter's generator could do just fine
too.  Perhaps this is best put off though for some time to give us a
rest from the release!


These timings are very interesting.  How does the Athlon manage to
double the fpu peak with the ordinary instructions?  They must have
some scheduler on the chip farming out consistent sets of fpu
instructions to two different units?

Take care,

> So, the good news is that the P4 looks a lot like a PIII at the greater
> clock speed, even when using the normal FPU (I had heard rumors that the
> P4 fpu was crippled), since you get roughly 72% of peak with dgemm (the
> exact number the PII gets; PIII's typically get more like 76%).  Here's
> some peak numbers (extracted from detailed timings below):
> 
> 
>                   Theo   dMatmul      dLU    dMM %     dLU %
>             Mhz   peak   (MFLOP)   (MFLOP)   of Mhz    of Mhz
>            ====   ====   =======   ======    ======    ======
> ATH :      1000   2000    1192.6    1003.1    119.3     100.3
> P4  :      1500   1500    1073.9     986.1     71.6      65.7
> IA64:       666   2664    1866.3    1336.0    280.2     200.6
> 
>            Theoretical    dMatmul      dLU   dLU %
>            peak (Mflop)    % peak   % peak   of dMM
>            ============   =======   ======   ======
> ATH :             2000       59.6     50.2     84.1
> P4  :             1500       71.6     65.7     91.8
> IA64:             2664       70.1     50.2     71.6
> 
> 
> OK, so peak performance-wise (where N=3000 is largest timings I took: both
> Athlon and IA64 LU numbers were still getting better, as you would expect by
> looking at their LU % of MM numbers), without SSE2, it looks like the P4 will
> need to be about 1.66 times faster than an Athlon to maintain the same GEMM
> peak, and about 1.53 times faster to maintain the same LU peak.  Since the LU
> peak should be perked up quite a bit by faster memory, it may look more like
> the MM numbers soon.  So, under these conditions, Athlon is the fp king of
> the two.  Athlon is far and away the flops/$ champion, and as far as I know,
> this is true of any machine on the market.
> 
> Anyway, the full timings are given below.  You'll see that the P4 does well
> early (probably due to superior memory), with the IA64 doing really poorly
> for small probs (memory again).
> 
> Cheers,
> Clint
> 
> 
>              100    200    300    400    500    600    700    800    900   1000
>            =====  =====  =====  =====  =====  =====  =====  =====  =====  =====
> ATH  dMM   909.1 1010.5 1080.0 1163.6 1087.0 1136.8 1143.3 1190.7 1205.0 1156.1
> P4   dMM   952.4 1010.5 1080.0  984.6 1041.7 1080.0 1055.4 1077.9 1088.1 1075.3
> IA64 dMM   866.3 1247.9 1472.4 1566.6 1570.6 1708.0 1645.1 1730.3 1710.2 1741.5
> 
> ATH  dLU   477.4  611.8  695.0  709.8  780.1  777.4  815.8  793.1  823.0  865.2
> P4   dLU   435.8  611.8  718.2  788.6  805.2  821.8  878.5  874.4  882.9  888.2
> IA64 dLU   241.2  419.4  554.3  652.8  754.2  800.4  832.4  873.0  926.0  937.0
> 
>             1200   1400   1600   1800   2000   2200   2400   2600   2800   3000
>            =====  =====  =====  =====  =====  =====  =====  =====  =====  =====
> ATH  dMM  1183.6 1172.6 1175.3 1192.6 1179.9 1175.3 1189.7 1191.2 1190.1 1187.3
> P4   dMM  1066.7 1067.7 1066.7 1065.3 1071.0 1073.9 1073.3 1072.7 1073.4  852.3
> IA64 dMM  1789.2 1809.9 1820.4 1858.1 1840.1 1823.3 1832.5 1810.5 1862.2 1866.3
> 
> ATH  dLU   878.8  923.4  925.2  943.2  950.3  965.5  974.9  983.5  994.6 1003.1
> P4   dLU   906.5  932.8  937.9  950.2  955.4  965.5  969.8  977.8  975.4  986.1
> IA64 dLU   990.7 1047.1 1077.9 1149.2 1179.4 1208.7 1240.7 1272.7 1305.7 1336.0
> 
>                           GEMM   SYMM   SYRK  SYR2K   TRMM   TRSM
>                          =====  =====  =====  =====  =====  =====
> ATH-1     500           1136.4 1000.0  835.0 1087.0  961.5  961.5
> P4-1.5    500           1041.7  961.5  835.0 1000.0  892.9 1041.7
> IA64-666  500           1610.1 1201.9 1462.9 1462.9 1082.5  816.6
> 
> 

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah