PDS: The Performance Database Server

Mm_2

**************************************** * Matrix Multiply Algorithm Results * * Results file: mm_2.tbl * * Source file: mm.c * * RAM usage: Need at LEAST 10 MBytes * * Al Aburto, aburto@nosc.mil * * 01 Oct 1997 * ****************************************

The Matrix Multiply program mm.c is by Mark Smotherman. His email address is: mark@cs.clemson.edu. Please contact Mark regarding the mm.c code or for questions, comments, and results showing wide variations. What results I get (Al Aburto, aburto@nosc.mil) I'll pass along to Mark too.

This table of results is kept at 'ftp.nosc.mil' (128.49.192.51) in directory 'pub/aburto'. You can access this and other programs and results via anonymous ftp. I try to keep things frequently and regularly updated.

mm.c is a collection of nine matrix multiply algorithms. Four of those algorithms were selected for this database. The algorithms and options are shown below. Compile mm.c as: cc -O -DN=500 mm.c -o mm (or use whatever other compile options you prefer) and then run mm with the options shown below. NOTE: You must use '-DN=500' else the matrix size will be undefined.

The results are very interesting as they reveal the enormous effect that cache thrashing can have on the results with different machines, algorithms, compilers, and compiler options.

There are even more efficient algorithms tuned for specific machines. Toshinori Maeno (tmaeno@cc.titech.ac.jp) of the Tokyo Institute of Technology has sent me a few examples for HP, IBM, DEC, and Sun.

The MFLOPS rating (for FADD and FMUL) can be obtained from the results. For example, for the D. Warner algorithm (mm -w 50), the number of FADD and FMUL instructions (weighted equally) is 2*N*N*N = 250,000,000 (for N = 500). Therefore MFLOPS = 2*N*N*N / Runtime, where Runtime is in seconds (see table below). Thus the IBM RS/6000 Model 950 is working at 250000000/3.65 = 68.5 MFLOPS relative to equally weighted FADD and FMUL instructions with the D. Warner algorithm with blocking of size 50. With a properly 'tuned' algorithm this could be improved further.

mm -p :option p - matrix multiply using pointers mm -v :option v - normal matrix multiply using temp variable mm -i :option i - matrix multiply with interchanged loops mm -w 50 :option w - matrix multiply using D. Warner method of blocking (size 50) and unrolling. mm -w 20 :option w - matrix multiply using D. Warner method of blocking (size 20) and unrolling. <<<