[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UPDATED MKL5.0 vs. ATLAS3.2.0 on 933Mhz PIII



Guys,

This is the same timing mail, but with an error fixed.  I got to wondering
why MKL was beating us for DGER, and when I scoped the defaults, I noticed
it was not using Camm's prefetch ger, but my old axpy-based implementation.
I got rid of the erroneous default, and Camm's stuff provided GER speedup.

Updated timings and discussion follow,
Clint


Some guys at Intel have been asking me to publish some ATLAS vs MKL numbers,
since most of my previous graphs compared against Greg Henri's BLAS.  I used
to compare against Greg's BLAS 'cause MKL wasn't available under Linux, and
it is always a pain for me to get access to a Windows platform.  MKL 5.1
is presently in BETA, and it has a Linux version.  Since it's a BETA,
however, Intel requires you to agree to an NDA saying you won't publish
any benchmarks using it, and the Intel people have been unable to free
me from the NDA.

I've been working on the windows stuff lately, however, and once I figured
out how to call MKL, I was able to get numbers with MKL 5.0, which does not
have a no-publish NDA.  Because I agree that comparing against Greg's stuff
is not the thing to do, I tried to do a fairly wide range of timings to 
clear the air here.  I include these timings below.

If I had to summarize these PIII timings, it would be that ATLAS blows chunks
for Level 1 BLAS, tends to be beat MKL for Level 2 BLAS, and varies between
quite a bit slower and quite a bit faster than MKL for Level 3 BLAS, 
depending on problem size and data type.

The Level 1 results are easily explained.  ATLAS's present Level 1 gets
its optimization mainly from the compiler.  This gives MKL two huge
advantages: MKL can use the SSE prefetch instructions to speed up pretty
much all Level 1 ops.  The second advantage is in how ABS() is done.
ABS() *should* be a 1-cycle operation, since you can just mask off the
sign bit.  However, you cannot standardly do bit operation on floats in
ANSI C, so ATLAS has to use an if-type construct instead.  This spells
absolute doom for the performance of NRM2, ASUM and AMAX.

For the Level 2 and 3, ATLAS has it's usual advantage of leveraging basic
kernels to the maximum.  This means that all Level 3 ops follow the performance
of GEMM, and Level 2 ops follow GER or GEMV.  MKL has the usual disadvantage
of optimizing all these routines seperately, leading to widely varying
performance.

For Level 2, ATLAS wins for pretty much all operations, sizes and precisions
other than small case [S,D] TRSV and TRMV.  ATLAS's success here is due mainly
to Camm's excellent prefetched Level 2 GEMV and GER kernels.

For the Level 3, we really have a mixed bag.  ATLAS's main weakness is in its
complex TRSM.  This is because TRSM cannot use the GEMM kernel as much as
the rest of the operations.  Anytime TRSM runs slower than TRMM, this is
the reason.  Complex is hit harder than real because I wrote a hand-tuned
kernel for real, while we must recur to 1 for complex.  The fix for this
poor performance requires some theory that we don't yet have: details
of the problem are posted on the developer site, if anyone is interested.
ATLAS is also in general less good at small problems than MKL.

The main weakness of MKL in the Level 3 operations is in it's handling of
single precision complex, where it doesn't look like they have SSE
optimizations yet.  MKL also tends to lose to ATLAS on pretty much everything
except GEMM for large problems.

For the factorizations, ATLAS tends to lose for small problems, and win for
large.  In part, this is because we recur down to 1; I am hoping to include
LU and possibly LLt that stop the recursion before one in the next developer
release.  Preliminary timings show this to make a large performance difference
for small problem sizes.  For complex, the poor small-size TRSM performance
also has a definite impact, and a crushing one for LLt.

Cheers,
Clint

*******************************************************************************
*                                  NOTES                                      *
*******************************************************************************
All timings were taken on a 933Mhz PIII, 256K L2, under Windows 2000, using
MKL 5.0 and ATLAS 3.2.0.

The ATLAS timers were used: this may mean performance is less than with
other timers, as ATLAS flushes the data caches before each call.

For all timings, M=K=N, alpha=1.0, beta=1.0, Side='Left', Uplo='Lower', 
TRANS='Notrans', DIAG='Nonunit', except for the Level 1, where alpha=2.0 for
real, and (2.0, 2.2) for complex.

No timings are given for 500x500 HERK and HER2K for MKL, 'cause this call gave
an access violation.

MKL does not possess the Level 1 routines DSDOT and SDSDOT.

No timings are given for N=100 or 200 complex Cholesky, 'cause our timer
couldn't get enough accuracy to be repeatable.

There's a lot of other timings that could be done, but I'm unlikely to do them.
I will be posting the library I built to do these timings to the prebuilt page
(and it was just a standard ATLAS install, anyway, if you want to install
yourself), if other people would like to time further.

Timings either have problem size or operation along X axis.  When problem
size is along the X axis, library (MKL for MKL 5.0, ATL, for ATLAS 3.2.0),
data type (S: single real, D: double real, C: single complex, Z: double complex)
and operation are given along Y.  When operation is along the X axis, 
library, data type and problem size are given along Y.
LU is GETRF, LLT is POTRF.

Theoretical peak for double precision for this machine is 933 MFLOP.  For
single precision using SSE (as both libraries do), theoretical peak is
3.732 GFLOP.

*******************************************************************************
*                            LEVEL 3 TIMINGS                                  *
*******************************************************************************
             100    200    300    400    500    600    700    800    900   1000
           =====  =====  =====  =====  =====  =====  =====  =====  =====  =====
MKL SGEMM 1327.7 1445.3 1400.6 1672.4 1584.3 1592.5 1661.6 1724.6 1675.9 1662.5
ATL SGEMM  911.6 1359.5 1347.4 1492.8 1502.4 1543.9 1544.5 1569.3 1599.9 1610.3
MKL DGEMM  640.2  648.4  648.0  664.4  680.3  673.9  697.2  704.7  691.3  699.5
ATL DGEMM  551.9  622.3  635.3  646.5  653.6  673.9  665.4  682.7  675.9  677.0
MKL CGEMM  773.8  818.8  766.0  819.2  810.4  825.6  820.6  829.5  825.7  825.8
ATL CGEMM 1094.9 1449.1 1542.9 1561.0 1524.4 1556.8 1554.7 1588.8 1595.2 1610.0
MKL ZGEMM  610.8  664.4  692.3  745.3  727.3  747.4  734.9  753.4  737.6  740.9
ATL ZGEMM  599.0  647.6  727.9  668.4  681.2  682.7  683.3  682.7  688.6  690.0

MKL SLU    477.8  751.1  846.4  839.1  810.1  837.4  812.9  909.4  887.7  906.3
ATL SLU    385.7  633.3  748.1  860.3  931.9  995.3 1019.7 1064.0 1109.9 1152.5
MKL DLU    366.5  462.0  475.6  487.3  484.7  497.6  504.2  519.0  518.2  526.6
ATL DLU    337.5  430.5  459.4  504.6  514.7  525.9  541.3  560.0  555.0  568.4
MKL CLU    606.4  667.2  644.5  641.8  646.1  669.3  664.9  682.3  690.8  696.4
ATL CLU    459.0  681.4  768.5  910.2  969.7 1052.4 1083.1 1134.4 1173.4 1201.3

MKL SLLT   288.4  459.2  568.5  644.8  683.2  753.0  763.9  782.0  779.3  821.2
ATL SLLT   244.3  407.1  530.0  632.1  730.0  808.4  833.9  887.4  953.3  970.4
MKL DLLT   298.5  416.3  428.9  442.4  461.3  461.7  473.4  775.6  486.8  496.8
ATL DLLT   256.5  348.6  403.9  428.3  445.5  478.0  505.9  508.9  501.9  534.1
MKL CLLT                 585.0  613.0  629.4  616.4  639.0  635.1  642.8  648.1
ATL CLLT                 585.0  686.5  715.5  840.3  862.4  912.8  959.1  983.3
MKL ZLLT                 465.0  550.1  695.8  597.3  599.7  616.7  629.9  638.2
ATL ZLLT                 385.9  456.5  466.3  499.3  506.4  527.8  537.8  524.7

                                               HEMM   HERK  HER2K
                                        GEMM   SYMM   SYRK  SYR2K   TRMM   TRSM
                                      ====== ====== ====== ====== ====== ======
MKL S100                              1362.4  581.7  504.0  414.8  800.0  711.2
ATL S100                               941.6 1049.3  688.2  912.3  598.1  542.3
MKL S500                              1560.1 1000.0 1079.7  959.1 1453.5  901.7
ATL S500                              1524.4 1422.5 1144.6 1500.0 1305.5 1102.5
MKL S1000                             1662.5 1163.5 1256.0 1142.9 1560.1 1033.1
ATL S1000                             1600.0 1524.4 1334.7 1600.0 1455.6 1333.3

MKL D100                               640.0  419.7  376.3  326.5  569.0  512.2
ATL D100                               556.3  543.0  400.0  541.5  473.9  465.7
MKL D500                               693.4  551.9  572.8  551.9  648.8  545.1
ATL D500                               666.7  615.8  522.6  666.7  600.0  600.0
MKL D1000                              699.3  606.6  621.7  598.1  666.7  566.6
ATL D1000                              688.2  656.4  587.4  666.7  639.8  639.8

MKL C100                               771.0  533.3  487.5  492.1  522.8  607.9
ATL C100                              1067.2 1033.1  608.1 1023.8  725.1  425.3
MKL C500                               810.4  718.9  712.7  703.2  727.5  728.5
ATL C500                              1522.1 1488.1 1187.2 1488.1 1334.7 1001.0
MKL C1000                              825.8  756.3  753.6  748.6  778.4  740.2
ATL C1000                             1605.1 1585.1 1370.8 1475.5 1515.3 1249.5

MKL Z100                               656.8  473.9  441.0  411.3  579.3  620.9
ATL Z100                               609.8  595.2  457.1  597.6  462.8  392.1
MKL Z500                               718.9  646.8                681.0  653.4
ATL Z500                               681.2  659.6  553.0  681.2  616.4  582.7 
MKL Z1000                              725.2  683.6  692.6  679.0  719.5  672.3
ATL Z1000                              689.1  678.1  625.0  681.8  660.1  638.7


*******************************************************************************
*                            LEVEL 2 TIMINGS                                  *
*******************************************************************************
                                        HEMV                 GERU    HER   HER2
                                 GEMV   SYMV   TRMV   TRSV    GER    SYR   SYR2
                               ====== ====== ====== ====== ====== ====== ======
MKL s100                        253.3  178.8  230.2  223.8  155.4   96.4  164.1
ATL s100                        301.7  323.2  176.8  175.9  188.2  163.3  246.2
MKL s500                        211.9  183.9  175.8  227.0  165.0  101.6  191.6
ATL s500                        340.6  463.8  227.0  223.8  192.8  172.1  283.1
MKL s1000                       319.0  215.5  201.2  301.8  173.3  105.6  195.9
ATL s1000                       414.2  358.5  340.4  333.3  185.4  174.9  273.0

MKL d100                        202.5  146.8  193.9  205.2   97.9   83.8  118.9
ATL d100                        186.1  145.4   89.9   86.0   95.8   83.8  119.0
MKL d500                        166.7  151.7  122.6  178.8  100.9   63.6  115.1
ATL d500                        203.8  192.8  157.6  159.2   96.4   93.0  147.5
MKL d1000                       167.8  152.6  123.1  173.0  100.0   65.6  117.9
ATL d1000                       208.4  189.8  176.8  176.8   95.2   90.4  147.9

MKL c100                        381.1  266.6  200.0  228.6  301.9  177.7  271.1
ATL c100                        695.4  615.7  355.6  296.2  323.2  275.9  421.2
MKL c500                        414.6  275.2  249.7  363.3  312.9  190.3  282.3
ATL c500                        693.7  693.7  581.5  570.9  322.4  307.4  419.9
MKL c1000                       429.0  276.4  258.2  397.0  311.4  182.9  284.1
ATL c1000                       706.1  676.3  635.4  622.6  314.4  300.3  408.2

MKL z100                        268.9  203.8  130.6  154.6  155.3  102.2  191.6
ATL z100                        380.8  304.9  205.1  189.3  192.7  169.3  225.3
MKL z500                        303.9  208.7  153.7  253.7  158.8  106.2  198.1
ATL z500                        375.6  301.0  316.5  307.4  186.7  178.6  234.6
MKL z1000                       317.8  207.7  159.6  294.2  160.4  105.4  195.2
ATL z1000                       373.8  305.5  345.1  341.5  177.5  174.8  224.2

*******************************************************************************
*                            LEVEL 1 TIMINGS                                  *
*******************************************************************************
                                                      DOTU
                  ROTM   SWAP   SCAL   COPY   AXPY     DOT  NRM2   ASUM   AMAX
                 ====== ====== ====== ====== ====== ====== ====== ====== ====== 
MKL s500         246.3   118.5   76.1  114.2  106.6  168.9  276.4  267.4  357.1
ATL s500         152.4    53.3   11.8   56.1   69.6   94.2   26.0   65.3   57.1
MKL d500         168.3    59.3   71.0   54.2   59.2  145.8  145.8  290.7  213.7
ATL d500          82.1    44.4   38.1   30.8   54.2   91.4   22.7   44.4   40.0
MKL c500                 110.1   21.8  110.4  188.0  320.5  641.0  320.5  400.0
ATL c500                  53.3   20.8   56.1  138.9  145.3   52.5   66.7   57.1
MKL z500                  57.1  118.5   60.3  103.3  127.9  454.5  228.3  228.3
ATL z500                  44.4   83.2   30.8   78.0  118.5   45.7   43.8   38.1