PDS: The Performance Database Server

Mm_3

**************************************** * Matrix Multiply Algorithm Results * * Results file: mm_3.tbl * * Source file: mm.c * * RAM usage: Need at LEAST 10 MBytes * * Al Aburto, aburto@nosc.mil * * 01 Oct 1997 * ****************************************

The Matrix Multiply program mm.c is by Mark Smotherman. His email address is: mark@cs.clemson.edu. Please contact Mark regarding the mm.c code or for questions, comments, and results showing wide variations. What results I get (Al Aburto, aburto@nosc.mil) I'll pass along to Mark too.

This table of results is kept at 'ftp.nosc.mil' (128.49.192.51) in directory 'pub/aburto'. You can access this and other programs and results via anonymous ftp. I try to keep things frequently and regularly updated.

mm.c is a collection of nine matrix multiply algorithms. The algorithms and options are shown below. Compile mm.c as: cc -O -DN=500 mm.c -o mm (or use whatever other compiler or compiler options you prefer) and then run mm with the options shown below. NOTE: You must use '-DN=500' else the matrix size will be undefined.

The results are very interesting as they reveal the enormous effect that cache thrashing can have on the results with different machines, algorithms, compilers, and compiler options.

There are even more efficient algorithms tuned for specific machines. Toshinori Maeno (tmaeno@cc.titech.ac.jp) of the Tokyo Institute of Technology has sent me a few examples for HP, IBM, DEC, and Sun.

The MFLOPS rating (for FADD and FMUL) can be obtained from the results. For example, for the T. Maeno algorithm (mm -m 20), the number of FADD and FMUL instructions (weighted equally) is N * N * ( 2 * N + 25 ) = 256250000 (for N = 500). Therefore MFLOPS = N*N*(2*N+25) / Runtime, where Runtime is in seconds (see table below). Thus the IBM RS/6000 Model 560 is working at 256250000/3.44 = 74.5 MFLOPS relative to equally weighted FADD and FMUL instructions. With a properly 'tuned' algorithm this could be further improved.

mm -n :option n - normal matrix multiply Floating-Point Operations: 2 * N * N * N

mm -u 8 :option u - innermost loop unrolled by factor of 8 Floating-Point Operations: 2 * N * N * N

mm -t :option t - matrix multiply using transpose of b matrix Floating-Point Operations: (2 * N - 1) * N * N

mm -b 32 :option b - matrix multiply using blocking (size 32) without unrolling. Floating-Point Operations: 2 * N * N * N

mm -m 20 :option m - matrix multiply using Maeno method of blocking (size 20) and unrolling. Floating-Point Operations: (2 * N + 25) * N * N

mm -p :option p - pointers used to access matrices. Floating-Point Operations: (2 * N - 1) * N * N

mm -v :option v - temporary variable in loop. Floating-Point Operations: (2 * N - 1) * N * N

mm -i :option i - interchanged inner loops. Floating-Point Operations: 2 * N * N * N

mm -w 50 :option w - D. Warner algorithm using 50 X 50 subarray Floating-Point Operations: 2 * N * N * N

mm -w 20 :option W - D. Warner algorithm using 20 X 20 subarray Floating-Point Operations: 2 * N * N * N

RESULTS BELOW SHOW MFLOPS RATING FOR FASTEST OPTION <<<