At the lowest level, the efficiency of the PBLAS is determined by the local performance of the BLAS and the BLACS. In addition, depending on the shape of its input and output distributed matrix operands, the PBLAS select the best algorithm in terms of data transfer across the process grid. Transparent to the user, this relatively simple selection process ensures high efficiency independent from the actual computation performed.
For example, there are algorithms [22][19][10], for matrix-matrix products like PUMMA which are much more efficient for equally sized input/output matrices. Some of these algorithms require a very large amount of workspace making them impractical for library purposes. However, a simple implementation of common matrix multiplication operations has recently been proven to be highly efficient and scalable [26]. These algorithms, called SUMMA, have the advantage of requiring much less workspace than PUMMA. These algorithms have, in some sense, already been implemented in terms of internal routines to the PBLAS [9]. Therefore, this work [26] will allow us to improve and generalize the model implementation. However, when one of the matrix operands is ``thin'' or ``fat'', the current model implementation employs different algorithms which are more efficient in the overall number of messages exchanged on the network, and are also usually much more economical in terms of workspace.
The current model implementation of the Level 3 PBLAS decides which algorithm to use depending on the shape of the matrix operands. This decision, however, could also be based on the amount of memory available during the execution, the local BLAS performance, and machine constants such as the latency and bandwidth of the network [4].
Internally, the PBLAS currently rely on routines requiring certain alignment properties to be satisfied [9]. These properties have been chosen so that maximum efficiency can be obtained on these restricted operations. Consequently, when redistribution or re-alignment of input or output data has to be performed some performance will be lost. So far, the PBLAS do not perform such redistribution or alignment of the input/output matrix or vector operands when necessary. However, the PBLAS routines would provide greater flexibility and would be more similar in functionality to the BLAS if these operations where provided. The question of making the PBLAS more flexible remains open and its answer largely depends on the needs of the user community.