[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: P4 timing update



Hi Clint!  Great!

R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Guys,
> 
> As I said, the first timings I sent out were just a default ATLAS install.
> I have since done some twiddling, which has resulted in the P4 getting
> an additional 15% or so speedup.  With this speedup, the 1.5Ghz P4 finally
> beats the 1Ghz Athlon for double precision flops.  As a matter of fact,
> this chip does pretty well up against the IA64 as well . . .
> 
> The trick was twofold: first, since the L1 cache is so small, and the L2
> cache is so fast, I found it was better to ignore the L1 cache and choose
> a large NB (say in the range of 72-80).  The second trick is that the P4
> seems to be better at out of order, or register renaming, or something along
> those lines than the PIII, 'cause you can choose lat=1 (rather than the real
> lat=12 or so) and use the extra registers for better register blockings.
> 

I wonder if this will apply to the SSE stuff too.  As you may recall,
we found, much to our surprise, that no pipelining performed best on
the P3.  I.e. mul a,b ; add b,c.  I think I now understand your
latency parameter to refer to what I'd been calling pipeline depth.
If this is so, is it worth reinvestigating the SSE pipeline
conclusion?  Has Intel documented how this *should* work anywhere?

Take care,

> I include the double precision results below.  In the new install, I can't
> yet time single precision, because the new, larger, NB breaks the SSE cleanup
> routines, but it looks like if we can fix that, sMM peak will go from
> around 3.7Gflop to 4.2Gflop.
> 
> Some people seemed confused, so here's an explanation of P4's theoretical peak:
> (1) If you are using the x86 FPU, theoretical peak for all precisions is
>     the Mhz of the machine (so 1.5Gflop for our P4)
> (2) If you are using SSE1 instructions for single precision, the theoretical
>     peak is 4*Mhz (6Gflop)
> (3) If you are using SSE2 instructions for double precision, the theoretical
>     peak is 2*Mhz (3Gflop)
> 
> ATLAS presently uses (2) for single precision, and (1) for double.  Thus the
> 1262.3MFLOP observed dmatmul timing represents roughly 84% of theoretical
> peak. 
> 
> Cheers,
> Clint
> 
> ATH : 1Ghz Athlon, SDRAM                           $1269
> P4  : 1.5Ghz Pentium 4, Rambus                     $2109
> IA64: 666Mhz Itanium, no idea on mem               ?????
> P40  : my original, non-optimal, ATLAS install on the P4
> 
>              100    200    300    400    500    600    700    800    900   1000
>           ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
> ATH  dMM   909.1 1010.5 1080.0 1163.6 1087.0 1136.8 1143.3 1190.7 1205.0 1156.1
> P4   dMM  1025.6 1194.0 1181.2 1238.7 1209.7 1234.3 1247.3 1264.2 1276.8 1242.2
> P40  dMM   952.4 1010.5 1080.0  984.6 1041.7 1080.0 1055.4 1077.9 1088.1 1075.3
> IA64 dMM   866.3 1247.9 1472.4 1566.6 1570.6 1708.0 1645.1 1730.3 1710.2 1741.5
> 
> ATH  dLU   477.4  611.8  695.0  709.8  780.1  777.4  815.8  793.1  823.0  865.2
> P4   dLU   428.7  659.8  763.1  851.7  887.6  933.9  951.8  974.3 1033.2 1040.9
> P40  dLU   435.8  611.8  718.2  788.6  805.2  821.8  878.5  874.4  882.9  888.2
> IA64 dLU   241.2  419.4  554.3  652.8  754.2  800.4  832.4  873.0  926.0  937.0
> 
>             1200   1400   1600   1800   2000   2200   2400   2600   2800   3000
>           ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
> ATH  dMM  1183.6 1172.6 1175.3 1192.6 1179.9 1175.3 1189.7 1191.2 1190.1 1187.3
> P4   dMM  1256.7 1250.1 1254.5 1262.3 1261.8 1258.6 1261.3 1261.7 1262.0 1260.5
> P40  dMM  1066.7 1067.7 1066.7 1065.3 1071.0 1073.9 1073.3 1072.7 1073.4  852.3
> IA64 dMM  1789.2 1809.9 1820.4 1858.1 1840.1 1823.3 1832.5 1810.5 1862.2 1866.3
> 
> ATH  dLU   878.8  923.4  925.2  943.2  950.3  965.5  974.9  983.5  994.6 1003.1
> P4   dLU  1046.6 1075.5 1091.8 1107.2 1110.7 1128.2 1126.3 1130.7 1139.5 1158.0
> P40  dLU   906.5  932.8  937.9  950.2  955.4  965.5  969.8  977.8  975.4  986.1
> IA64 dLU   990.7 1047.1 1077.9 1149.2 1179.4 1208.7 1240.7 1272.7 1305.7 1336.0
> 
>                           GEMM   SYMM   SYRK  SYR2K   TRMM   TRSM
>                          =====  =====  =====  =====  =====  =====
> ATH       500           1136.4 1000.0  835.0 1087.0  961.5  961.5
> P4        500           1209.7 1171.9 1002.0 1209.7 1056.3 1056.3
> P40       500           1041.7  961.5  835.0 1000.0  892.9 1041.7
> IA64      500           1610.1 1201.9 1462.9 1462.9 1082.5  816.6
> 
> 
> 

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah