J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Whaley
May, 1995
At first glance, because of the apparent simplicity of its sequential counterpart as well as the regularity of the data structures involved in dense linear algebra computations, implementing an equivalent set of parallel routines in terms of portability, efficiency, and ease-of-use seems relatively simple to achieve.
However, when these routines are actually coded, the problem becomes much more complex due to difficulties which do not occur in serial computing. First, there are many different parallel computer architectures available. In view of this fact, it is natural to choose a virtual machine topology that is convenient for dense linear algebra computations and map the virtual machine onto existing topologies. Second, the selected data distribution scheme must ensure good load-balance to guarantee performance and scalability. Finally, for ease-of-use and software reusability reasons, the interface of the top-level routines must closely resemble the sequential BLAS interface yet still be flexible enough to take advantage of efficient parallel algorithmic techniques such as computation and communication overlapping and pipelining.
This paper presents a reasonable set of adoptable solutions to successfully design and implement the Parallel Basic Linear Algebra Subprograms. These subprograms can in turn be used to develop parallel libraries such as ScaLAPACK for a large variety of distributed memory MIMD computers.