next up previous contents
Next: Choice of Block Size Up: ScaLAPACK: A Portable Previous: Heterogeneous Networks

Performance

 

An important performance metric is parallel efficiency. Parallel efficiency, E(N, P), for a problem of size N on P processors is defined in the usual way [19] as

  equation282

where T(N,P) is the runtime of the parallel algorithm, and tex2html_wrap_inline1509 is the runtime of the best sequential algorithm. An implementation is said to be scalable if the efficiency is an increasing function of N/P, the problem size per processor (in the case of dense matrix computations, tex2html_wrap_inline1513 , the number of words in the input).

We will also measure the performance of our algorithm in Megaflops/sec (or Gigaflops/sec). This is appropriate for large dense linear algebra computations, since floating point dominates communication. For a scalable algorithm with N/P held fixed, we expect the performance to be proportional to P.

We seek to increase the performance of our algorithms by reducing overhead due to load imbalance, data movement, and algorithm restructuring. The way the data are distributed over the memory hierarchy of a computer is of fundamental importance to these factors. We present in this section extensive performance results on various platforms for the ScaLAPACK factorization and reductions routines. Figures 5 and 6 show the performance of LU, Cholesky and QR factorizations for various sizes of matrices and number of processes on an Intel Paragon and IBM SP-2. The number of processes used in the experiments is specified by a process grid dimension cited next to each curve in the graph. As can be seen the algorithms show scalability as the size of the problem and the number of processes are increased.

Performance data for the symmetric positive definite banded solvers is presented in Figure 7. The execution time reflects three parameters: the execution time of sequential LAPACK banded routines on each process, the redundancy in operation count of a factor of four caused by fill-in, and the logarithmic execution time of the reduced system. In Figure 7, we graph execution rate as a function of problem size while holding the work per process constant. There are two curves in Figure 7: the top curve plots the actual computational speed of the algorithm using the operation count from the parallel band algorithm, while the bottom curve plots speed relative to the sequential operation count. The latter curve actually shows slower execution time as the number of processes increases from 1 to 2 and from 2 to 4 due to the increased operation count incurred by band partition algorithms due to fill-in. Note the almost linear scalability beyond P=4, with the slight drop-off due to the logarithmic complexity of the block odd-even reduction. The first curve drops mildly going from P=1 to P=2 and P=4 due to the introduction of communication versus the P=1 case.

Performance data for the symmetric eigensolver (PDSYEVX) are presented in [13].

The timings in Figure 8 represent a single super iteration (an iteration with multiple bulges) of the implicit multiple double shift nonsymmetric QR eigenvalue problem (PDLAHQR.) PDLAHQR maintains similiar accuracy as DLAHQR. Efficiencies are reported as compared to PDLAHQR run on a single node. Because PDLAHQR uses applies Householder transforms in a block fashion, it is faster than DLAHQR on a single node. We have seen speed-ups compared to DLAHQR on 144 Paragon nodes actually surpass 144 times the maximum performance of DLAHQR. In each case, a distribution block size between 50 and 125 was chosen, but these were blocked into smaller sets (12-30) for communication, and further blocked in sets of three for computation. The largest problems run were problems chosen to fill the same amount of memory, in this case, the Hessenberg matrix and Schur vectors and work buffers all fit into around 52 Mbytes. The work on the Schur vectors was also included. The number of processes used in the experiments is specified by a process grid dimension cited next to each curve in the graph. Actual experiments indicate that overall performance of the code, using standard flop count heuristics such as found in [21], is sharply underestimated by looking at a single super iteration.




next up previous contents
Next: Choice of Block Size Up: ScaLAPACK: A Portable Previous: Heterogeneous Networks

Susan Blackford
Thu Jul 25 15:38:00 EDT 1996