The ScaLAPACK routines have been specifically designed to allow for an even distribution of the computational load and thus to achieve the highest possible performance. Therefore, the overall execution time is strongly related to the rate of floating-point operations per second (flop/s) that the slowest processor in the machine configuration can achieve.
This behavior
can easily be
observed if
some factor
slows a
particular
processor of
the system.
Consider, for
instance, a ten-processor machine
configuration.
Suppose
that
nine of the processors
can
deliver a peak
performance
of 100 megaflop/s
(Mflop/s) but that
the tenth processor
can achieve
only 20 Mflop/s.
(On a homogeneous
system, different
versions of the
operating system
and/or memory
capacities, I/O
traffic, or
simply another
user's program
can easily cause
such a performance
degradation.)
On such a ten-processor machine,
the overall ScaLAPACK
peak performance is
thus limited to 200
Mflop/s, whereas the
performance of the
machine with nine 100-megaflop/s processors
is 900 Mflop/s.
Specifically, the most
heavily loaded
processor
controls execution time.
The implications are clear.
If a user's code is
running on nine unloaded
processors and one
processor with a
load factor of 5,
one can observe no
more than a factor
of
speedup.
Similarly,
it is possible on
some systems to
spawn multiple
processes on a
single processor.
In such a case,
performance is
limited by the
slowest processor,
presumably the one
with the most processes.
For example, if 10
processes are spawned
on 9 identical
processors, the
speedup is limited
to .
The load of the machine, in addition to the direct effect of offering a program only a portion of the total cycles, can have several indirect effects. If each processor is individually scheduled, performance can be arbitrarily poor because significant progress is possible only when all processes are concurrently scheduled. A loaded machine may also cause one's data to be swapped out to disk, which can greatly reduce peak performance.