The current model implementation of the PBLAS assumes the matrix operands to be distributed according to the block-cyclic decomposition scheme. This allows the routines to achieve scalability, well balanced computations and to minimize synchronization costs. It is not the object of this paper to describe in detail the data mapping onto the processes, for further details see [13][7]. Let us simply say that the set of processes is mapped to a virtual mesh, where every process is naturally identified by its coordinates in this grid. This virtual machine is in fact part of a larger object defined by the BLACS and called a context [14].

**Figure 1:** A matrix decomposed into
blocks mapped onto a process grid.

An `M_` by `N_` matrix operand is first
decomposed into `MB_` by `NB_` blocks
starting at its upper left corner. These blocks
are then uniformly distributed across the process
mesh. Thus every process owns a collection of
blocks, which are locally and contiguously stored
in a two dimensional ``column major'' array.
We present in Fig. 1 the mapping
of a `5``5` matrix partitioned
into `2``2` blocks mapped onto
a `2``2` process grid, i.e
`M_=N_=5` and `MB_=NB_= 2`.
The local entries of every matrix column are
contiguously stored in the processes' memories.

It follows that a general `M_` by `N_`
distributed matrix is defined by its dimensions,
the size of the elementary `MB_` by `NB_`
block used for its decomposition, the coordinates of
the process having in its local memory the first
matrix entry `{RSRC_,CSRC_}`, and the BLACS
context (`CTXT_`) in which this matrix is defined.
Finally, a local leading dimension `LLD_` is
associated with the local memory address pointing
to the data structure used for the local storage of
this distributed matrix. In Fig. 1,
we choose for illustration purposes
`RSRC_=CSRC_=0`. In addition, the local
arrays in process row 0 must have a leading dimension
`LLD_` greater than or equal to 3, and greater
than or equal to 2 in the process row 1.

These pieces of information are grouped together
into a single 8 element integer array, called the
descriptor, `DESC_`. Such a descriptor is
associated with each distributed matrix. The
entries of the descriptor uniquely determine the
mapping of the matrix entries onto the local
processes' memories. Moreover, with the exception
of the local leading dimension, the descriptor
entries are global values characterizing the
distributed matrix operand. Since vectors may
be seen as a special case of distributed matrices
or proper submatrices, the larger scheme just
defined encompasses their description as well.

For distributed symmetric and Hermitian matrices,
only the upper (`UPLO='U'`) triangle or the
lower (`UPLO='L'`) triangle is stored. For
triangular distributed matrices, the argument
`UPLO` serves to define whether the matrix
is upper (`UPLO='U'`) or lower
(`UPLO='L'`) triangular.

For a distributed Hermitian matrix the imaginary
parts of the diagonal elements are zero and thus
the imaginary parts of the corresponding FORTRAN
or C local arrays need not be set, but are assumed
to be zero. In the `PHER` and
`PHER2` routines, these imaginary parts
will be set to zero on return, except when
is equal to zero, in which case the routines exit
immediately. Similarly, in the `PHERK`
and `PHER2K` routines the imaginary parts
of the diagonal elements will also be set to zero
on return, except when is equal to one and
or is equal to zero, in which case
the routines exit immediately.

Thu Aug 3 07:53:00 EDT 1995