We suggest the following approach to obtain high performance with ScaLAPACK codes:
The standard data distribution will typically achieve 25-50% of the peak performance possible (depending in part on how many processors are ignored, i.e., the difference between and ). We do not recommend experimenting with different data distributions until performance that is acceptable (or nearly so) has been achieved. If each individual node requires a block size larger than 64 to achieve near-peak performance on local matrix-matrix multiply, the block size may have to be increased. This step is unlikely, however, unless the computer has a shared-memory multiprocessor with more than four processors on each node.