The PDSYRK routine performs rank- updates
on an
symmetric matrix
with an
column of blocks
.
After broadcasting
rowwise and transposing it,
each process updates its local portion of
with its own copy
of
and
.
The update is complicated by the fact that the globally lower triangular
matrix
is not necessarily stored
in the lower triangular form in the local processes. For details
see [8].
The simplest way to do this is to repeatedly update one column of blocks
of
. However, if the block size is small,
this updating process will not be efficient.
It is more efficient to update several blocks of columns
at a time.
The PBLAS routine, PDSYRK efficiently updates
by combining
several blocks of columns at a time. For details, see [8].
Figure 10: Performance of the Cholesky factorization on the Intel iPSC/860,
Delta, and Paragon.
The effect of the block size on the performance of the Cholesky factorization
is shown in Figure 9 on and
processors of the Intel Delta.
The best performance was obtained at the block size of
, but
relatively good performance could be expected with the block size of
, since the routine updates multiple column panels at a time.
Figure 10 shows the performance of the
Cholesky factorization routine.
The best performance was attained with the aspect ratio of .
The routine ran at 1.8 Gflops for
on the iPSC/860;
10.5 Gflops for
on the Delta; and
16.9 Gflops for
on the Paragon.
Since it requires fewer floating point operations (
) than
the other factorizations, it is not surprising
that its flop rate is relatively poor.
If is not positive definite, the Cholesky factorization should be
terminated in the middle of the computation.
As outlined in Section 3.3,
a process
computes the Cholesky factor
from
.
After computing
, process
broadcasts a flag
to all other processes to stop the computation
if
is not positive definite.
If
is guaranteed to be positive definite,
the process of broadcasting the flag can be skipped,
leading to a corresponding increase in performance.