next up previous contents index
Next: ScaLAPACK Performance Up: BLACS as an Efficient Previous: BLACS as an Efficient

Parallel Efficiency

   

An important performance metric is parallel efficiency. Parallel efficiency, E(N, P),  for a problem of size N on P nodes is defined in the usual way [65, 92] by
displaymath16290
where T(N,P)  is the runtime of the parallel algorithm, and tex2html_wrap_inline16300  is the runtime of the best sequential algorithm. For dense matrix computations, an implementation is said to be scalable  if the parallel efficiency is an increasing function of tex2html_wrap_inline16304, the problem size per node. The algorithms implemented in the ScaLAPACK library are scalable in this sense.

Figure 5.1 shows the scalability of the ScaLAPACK implementation of the LU factorization on the Intel XP/S Paragon computer. The nodes of the Intel XP/S Paragon computer are general-purpose (GP) or multiprocessor (MP) nodes, based on the Intel i860 XP RISC processors. Each Intel i860 processor is capable of a peak performance of 50 Mflop/s. On such a processor, however, the vendor-supplied BLAS matrix-matrix multiply routine DGEMM can achieve only approximately 45 Mflop/s. The computer used for obtaining the performance results presented in this chapter consisted of MP nodes configured as follows: each MP node had three Intel i860 XP processors -- two to execute application code and a third used exclusively as a message coprocessor  . On such a node, the vendor-supplied BLAS matrix-matrix multiply routine DGEMM can achieve approximately 90 Mflop/s.

  figure3673
Figure 5.1: LU Performance per Intel XP/S MP Paragon node

Figure 5.1 shows the speed in Mflop/s per node of the ScaLAPACK LU factorization routine PDGETRF for different computer configurations. This figure illustrates that when the number of nodes is scaled by a constant factor, the same efficiency or speed per node is achieved for equidistant problem sizes on a logarithmic scale. In other words, maintaining a constant memory use per node allows efficiency to be maintained. (This scalability behavior is also referred to as isoefficiency,  or isogranularity .) In practice, however, a slight degradation is acceptable. The ScaLAPACK driver routines, in general, feature the same scalability behavior up to a constant factor that depends on the exact number of floating-point operations and the total volume of data exchanged during the computation.

In large dense linear algebra computations, the computation cost dominates the communication cost. In the following, the time to execute one floating-point operation by one node is denoted by tex2html_wrap_inline12202 . The time to communicate a message between two nodes is approximated by a linear function of the number of items communicated. The function is the sum of the time to prepare the message for transmission (tex2html_wrap_inline12208)  and the time taken by the message to traverse the network to its destination, that is, the product of its length by the time to transfer one data item (tex2html_wrap_inline12228) . Alternatively, tex2html_wrap_inline12208 is also called the latency,   since it is the time to communicate a message of zero length. On most modern interconnection networks, the order of magnitude of the latency varies between a microsecond and a millisecond.

The bandwidth   of the network is also referred to as its throughput.  It is proportional to the reciprocal of tex2html_wrap_inline12228. On modern networks, the order of magnitude of the bandwidth is the megabyte per second. For a scalable algorithm with tex2html_wrap_inline16304 held constant, one expects the performance to be proportional to P. The algorithms implemented in ScaLAPACK are scalable in this sense. Table 5.1 summarizes the relevant constants used in our scalability analysis.

  table3694
Table 5.1: Variable definitions

Using the notation presented in table 5.1, the execution time of the ScaLAPACK drivers can be approximated by
 equation3728

The corresponding parallel efficiency  can then be approximated by
 equation3739
Equation 5.2 illustrates, in particular, that the communication versus computation performance ratio of a distributed-memory computer significantly affects parallel efficiency. The ratio of the latency to the time per flop tex2html_wrap_inline16374 greatly affects the parallel efficiency of small problems. The ratio of the network throughput to the flop rate tex2html_wrap_inline16376 significantly affects the parallel efficiency of medium-sized problems. For large problems, the node flop rate tex2html_wrap_inline16378 is the dominant factor contributing to the parallel efficiency of the parallel algorithms implemented in ScaLAPACK.


next up previous contents index
Next: ScaLAPACK Performance Up: BLACS as an Efficient Previous: BLACS as an Efficient

Susan Blackford
Tue May 13 09:21:01 EDT 1997