Band matrix storage in ScaLAPACK follows the conventions of LAPACK. In LAPACK, an m-by-n band matrix with kl subdiagonals and ku superdiagonals may be stored compactly in a two-dimensional array with at least kl+ku+1 rows and n columns. Columns of the matrix are stored in corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. The same array layout is specified in ScaLAPACK except the array is divided across the processes.
The optimal algorithm for solving banded (and as a special case, tridiagonal) linear systems depends on a variety of parameters, especially the bandwidth. Currently, only algorithms designed for the case are implemented. The general family of algorithms of this sort is variously known as partitioning methods, domain decomposition methods, or divide and conquer methods. A prototypical algorithm of this form is described in [16].
Partitioning methods utilize large-grained task parallelism, in contrast with the algorithms used in dense ScaLAPACK which are fine-grained. As such, the appropriate decomposition for these algorithms differs from the dense square matrix case, and a one-dimensional blocked decomposition is used.
The bottleneck that has traditionally limited the effectiveness of partitioning methods is the solution of the reduced system that represents the interaction of the systems stored on each process. Many implementations have adopted what are essentially sequential algorithms for the reduced system, but this detracts from scalability. For ScaLAPACK we have implemented a block odd-even reduction algorithm whose running time scales optimally.