The current model implementation of the PBLAS
assumes the matrix operands to be distributed
according to the block-cyclic decomposition
scheme. This allows the routines to achieve
scalability, well balanced computations and
to minimize synchronization costs. It is not
the object of this paper to describe in detail
the data mapping onto the processes, for further
details see [13][7]. Let us simply
say that the set of processes is mapped to a
virtual mesh, where every process is naturally
identified by its coordinates in this
grid. This virtual machine is in fact part of a
larger object defined by the BLACS and called a
context [14].
Figure 1: A matrix decomposed into
blocks mapped onto a
process grid.
An M_ by N_ matrix operand is first
decomposed into MB_ by NB_ blocks
starting at its upper left corner. These blocks
are then uniformly distributed across the process
mesh. Thus every process owns a collection of
blocks, which are locally and contiguously stored
in a two dimensional ``column major'' array.
We present in Fig. 1 the mapping
of a 55 matrix partitioned
into 2
2 blocks mapped onto
a 2
2 process grid, i.e
M_=N_=5 and MB_=NB_= 2.
The local entries of every matrix column are
contiguously stored in the processes' memories.
It follows that a general M_ by N_ distributed matrix is defined by its dimensions, the size of the elementary MB_ by NB_ block used for its decomposition, the coordinates of the process having in its local memory the first matrix entry {RSRC_,CSRC_}, and the BLACS context (CTXT_) in which this matrix is defined. Finally, a local leading dimension LLD_ is associated with the local memory address pointing to the data structure used for the local storage of this distributed matrix. In Fig. 1, we choose for illustration purposes RSRC_=CSRC_=0. In addition, the local arrays in process row 0 must have a leading dimension LLD_ greater than or equal to 3, and greater than or equal to 2 in the process row 1.
These pieces of information are grouped together into a single 8 element integer array, called the descriptor, DESC_. Such a descriptor is associated with each distributed matrix. The entries of the descriptor uniquely determine the mapping of the matrix entries onto the local processes' memories. Moreover, with the exception of the local leading dimension, the descriptor entries are global values characterizing the distributed matrix operand. Since vectors may be seen as a special case of distributed matrices or proper submatrices, the larger scheme just defined encompasses their description as well.
For distributed symmetric and Hermitian matrices, only the upper (UPLO='U') triangle or the lower (UPLO='L') triangle is stored. For triangular distributed matrices, the argument UPLO serves to define whether the matrix is upper (UPLO='U') or lower (UPLO='L') triangular.
For a distributed Hermitian matrix the imaginary
parts of the diagonal elements are zero and thus
the imaginary parts of the corresponding FORTRAN
or C local arrays need not be set, but are assumed
to be zero. In the PHER and
P
HER2 routines, these imaginary parts
will be set to zero on return, except when
is equal to zero, in which case the routines exit
immediately. Similarly, in the P
HERK
and P
HER2K routines the imaginary parts
of the diagonal elements will also be set to zero
on return, except when
is equal to one and
or
is equal to zero, in which case
the routines exit immediately.