next up previous
Next: ScaLAPACK Strategy Up: PerformancePortability, and Scalability Previous: The BLACS as an

Parallel Efficiency

An important performance metric is parallel efficiency. Parallel efficiency, E(N, p), for a problem of size N on p processors is defined in the usual way [21, 28] by
where T(N,p) is the runtime of the parallel algorithm, and tex2html_wrap_inline700 is the runtime of the best sequential algorithm. For dense matrix computations, an implementation is said to be scalable if the parallel efficiency is an increasing function of tex2html_wrap_inline702, the problem size per processor. The algorithms implemented in the ScaLAPACK library are scalable in this sense.

Figure 1 shows the scalability of the ScaLAPACK implementation of the LU factorization on the Intel XP/S MP Paragon. The figure shows the speed in Mflop/s of the ScaLAPACK LU factorization routine PDGETRF for different machine configurations. When the number of nodes is scaled by a constant factor (2 in the figure), the same efficiency or speed per node is achieved for equidistant problem sizes on a logarithmic scale. In other words, maintaining a constant memory use per node allows efficiency to be maintained. In practice, however, a slight degradation is acceptable. The ScaLAPACK driver routines in general feature the same scalability behavior up to a constant factor that depends on the exact number of floating-point operations and the total volume of data exchanged during the computation.

Figure 1: LU Performance per Intel Paragon node

The performance of the algorithms implemented in ScaLAPACK is also measured in megaflop/s per second (or gigaflop/s per second). This measurement is appropriate for large dense linear algebra computations since the computation cost dominates the communication cost. In the following, the time to execute one floating-point operation by one processor is denoted tex2html_wrap_inline704. The time to communicate a message between two processors is approximated by a linear function of the number of items communicated. The function is the sum of the time to prepare the message for transmission (tex2html_wrap_inline706) and the time taken by the message to traverse the network to its destination, that is, the product of its length by the time to transfer one data item (tex2html_wrap_inline708). Alternatively, tex2html_wrap_inline706 is also called the latency, since it is the time to communicate a message of zero length. On most modern interconnection networks, the order of magnitude of the latency varies between a microsecond and a millisecond.

The bandwidth of the network is also referred to as its throughput. It is proportional to the reciprocal of tex2html_wrap_inline708. On modern networks, the order of magnitude of the bandwidth is a megabyte per second. For a scalable algorithm with tex2html_wrap_inline702 held fixed, one expects the performance to be proportional to p. The algorithms implemented in ScaLAPACK are scalable in this sense. It follows that the execution time of the ScaLAPACK drivers can thus be approximated by

The corresponding parallel efficiency can then be approximated by
Equation 1 illustrates, in particular, that the communication versus computation performance ratio of a distributed-memory concurrent computer significantly affects parallel efficiency. The ratio of the latency to the time per flop tex2html_wrap_inline718 greatly affects the parallel efficiency of small problems. The ratio of the network throughput to the flop rate tex2html_wrap_inline720 significantly affects the parallel efficiency of medium-sized problems. For large problems, the processor flop rate tex2html_wrap_inline722 is the dominant factor contributing to the parallel efficiency of the parallel algorithms implemented in ScaLAPACK.

next up previous
Next: ScaLAPACK Strategy Up: PerformancePortability, and Scalability Previous: The BLACS as an

Jack Dongarra
Sat Feb 1 08:18:10 EST 1997