An important performance metric is *parallel efficiency*. Parallel
efficiency, *E*(*N*, *P*),
for a problem of size *N* on *P* nodes is
defined in the usual
way [65, 92] by

where *T*(*N*,*P*) is the
runtime of the parallel algorithm, and is the runtime
of the best sequential
algorithm. For dense
matrix computations,
an implementation is
said to be *scalable*
if the parallel efficiency
is an increasing function
of , the problem
size per node. The
algorithms implemented in
the ScaLAPACK library are
scalable in this sense.

Figure 5.1
shows the scalability
of the ScaLAPACK
implementation of
the *LU* factorization
on the Intel XP/S
Paragon computer.
The nodes of the
Intel XP/S Paragon
computer are
general-purpose (GP)
or multiprocessor (MP)
nodes, based on the
Intel i860
XP RISC processors. Each
Intel i860 processor is
capable of a peak performance
of 50 Mflop/s. On such a
processor, however, the
vendor-supplied BLAS
matrix-matrix multiply
routine DGEMM can
achieve only approximately
45 Mflop/s. The
computer used for
obtaining the
performance results
presented in this
chapter consisted
of MP nodes configured
as follows: each
MP node had three
Intel i860 XP processors
-- two to execute
application code
and a third used
exclusively as a
message coprocessor .
On such a node, the
vendor-supplied BLAS
matrix-matrix multiply
routine DGEMM can
achieve approximately
90 Mflop/s.

**Figure 5.1:** LU Performance per Intel XP/S MP Paragon node

Figure 5.1
shows the speed in
Mflop/s per node
of the ScaLAPACK
*LU* factorization
routine PDGETRF
for different
computer configurations.
This figure illustrates
that when the number of
nodes is scaled by a
constant factor, the
same efficiency or
speed per node is
achieved for equidistant
problem sizes on a
logarithmic scale.
In other words,
maintaining a constant
memory use per node
allows efficiency
to be maintained.
(This scalability behavior is also referred to as *isoefficiency,*
or *isogranularity* .)
In practice, however,
a slight degradation
is acceptable.
The ScaLAPACK driver
routines, in general,
feature the same
scalability behavior
up to a constant
factor that depends
on the exact number
of floating-point
operations and the
total volume of data
exchanged during the
computation.

In large dense linear
algebra computations,
the computation
cost dominates the
communication cost.
In the following, the
time to execute one
floating-point operation
by one node is denoted
by . The time to
communicate a message
between two nodes is
approximated by a
linear function of
the number of items
communicated. The function
is the sum of the time
to prepare the message
for transmission ()
and the time taken
by the message
to traverse the
network to its
destination, that is,
the product of its
length by the time
to transfer one
data item () .
Alternatively,
is also called the
*latency,* since
it is the time to
communicate a
message of zero
length. On most
modern interconnection
networks, the order of
magnitude of the latency
varies between a
microsecond and a
millisecond.

The bandwidth of
the network is also referred to as
its *throughput.*
It is proportional to the reciprocal of . On modern
networks, the order of magnitude of the bandwidth is the
megabyte per second. For a scalable algorithm with
held constant, one expects the performance to
be proportional to *P*. The algorithms implemented in
ScaLAPACK are scalable in this sense. Table 5.1
summarizes the relevant constants used in our scalability
analysis.

**Table 5.1:** Variable definitions

Using the notation presented in table 5.1,
the execution time of the ScaLAPACK drivers can be
approximated by

The corresponding parallel efficiency
can then be approximated by

Equation 5.2 illustrates, in particular, that
the communication versus computation performance ratio of a
distributed-memory computer significantly affects parallel efficiency. The
ratio of the latency to the time per flop greatly
affects the parallel efficiency of small problems. The ratio
of the network throughput to the flop rate
significantly affects the parallel efficiency of medium-sized
problems. For large problems, the node flop rate
is the dominant factor contributing to the parallel
efficiency of the parallel algorithms implemented in ScaLAPACK.

Tue May 13 09:21:01 EDT 1997