Achieving High Performance on a Distributed-Memory Computer
Use an efficient data distribution.
- Block size (I.e., MB,NB) = 64.
- Square processor grid, Pr = Pc.
Use efficient machine-specific BLAS (not the Fortran77 reference implementation from netlib) and BLACS (nondebug, BLACSDBGLVL=0 in Bmake.inc)