Next: Summary and Conclusions Up: Key Concepts For Parallel Previous: A Parallel Algorithm

Performance Results

In this section some preliminary performance results are presented for the parallel left-looking LU factorization algorithm running on an Intel Paragon concurrent computer. These results are intended to illustrate a few general points about the performance of the algorithms used, and do not constitute a detailed performance study. In the work presented here we were constrained by difficulties encountered in getting exclusive access to the Paragon for sufficiently long periods. In addition we found that the parallel file system of the Paragon to which we had access was close to full much of the time. We hope to overcome these problems in the future and undertake a detailed performance study in future work. All the runs were made in exclusive use mode, i.e., with logins disabled to prevent other users accessing the system. This was done because the performance of PFS is affected by the load on the service nodes, even if other users are just editing or compiling.

The first runs were done using the version of the algorithm that maintains the partially factored matrix in unpivoted form throughout the algorithm. Timing results are shown for and process meshes in Tables 1 and 2 for a distributed out-of-core matrix. In these cases we say that the matrix was both logically and physically distributed because each processor opens a separate file. As expected for this version of the algorithm, the time spent writting to PFS is much less than the time spent reading. However, the most striking aspect of the timings is the fact that pivoting dominates. The large amount of time spent pivoting arises because each time a superblock is read in all the pivots evaluated so far must be applied to it. For a sequential algorithm (i.e., ), a total of superblocks of width elements must be pivoted. Thus, pivoting entails exchanges of elements, which is of the same order as the I/O cost. In the parallel case, we must replace by the width of a superblock, . Thus, in order for the version of the algorithm that stores the matrix in unpivoted form to be asymptotically faster than the version that stores the matrix in pivoted form we require

where W and R are the costs of writing and reading an element, respectively, and P is the cost of pivoting an element.

table529
Table: Timings in seconds for the main phases of out-of-core LU factorization of matrices. Results are shown for M=5000, 8000 and 10000. In all cases , , P=4, and Q=4. The version of the algorithm that stores the matrix in unpivoted form and performs pivoting on the fly was used. The out-of-core matrix was physically and logically distributed.

table547
Table: Timings in seconds for the main phases of out-of-core LU factorization of matrices. Results are shown for M=5000, 8000 and 10000. In all cases , , P=8, and Q=8. The version of the algorithm that stores the matrix in unpivoted form and performs pivoting on the fly was used. The out-of-core matrix was physically and logically distributed.

In general, there is no reason why writing should be substantially faster then reading, so we would not expect Eq. 22 to hold. Thus, the version of the algorithm that stores the matrix in pivoted form is expected to be faster. This is borne out by the timings presented in Table 3 for an process mesh. These timings are directly comparable with those of Table 2, and show that the version of the algorithm that stores the matrix in pivoted form is faster by 10-15%. Note that the time for writing is slightly more than half the time for reading, suggesting that it takes slightly longer to write a superblock than to read it.

table568
Table: Timings in seconds for the main phases of out-of-core LU factorization of matrices. Results are shown for M=5000, 8000 and 10000. In all cases , , P=8, and Q=8. The version of the algorithm that stores the matrix in pivoted form was used. The out-of-core matrix was physically and logically distributed.

We next attempted to investigate the effect of varying the width of the superblock by increasing from 2 to 10. The results are shown in Table 4. A problem will fit in core if the memory required in each process to hold two superblocks exceeds that required to hold the entire matrix, i.e., if

or . Thus, for the parameters of Table 4 the M=5000 and M=8000 cases fit in core, so we just read in the whole matrix, factorize it using the standard ScaLAPACK routine P_GETRF, and then write it out again. In Table 4 it takes about 58 seconds to perform an in-core factorization of a matrix, compared with 191 seconds for an out-of-core factorization (see table 3). The M=8000 case in Table 4 failed, presumably because PFS was not able to handle the need to simultaneously read 8 Mbytes from each of 64 separate files. The M=10000 case ran successfully out-of-core, and the results in Table 4 should be compared with those in Table 3, from which we observe that increasing increases the time for I/O and factorization, but decreases the times for all other phases of the algorithm. The increase in I/O is an unexpected result since increasing should decrease the I/O cost. Perhaps the larger value of increases the I/O cost because larger amounts of data are being read and written, leading to congestion in the parallel I/O system.

table597
Table: Timings in seconds for the main phases of out-of-core LU factorization of matrices. Results are shown for M=5000, 8000 and 10000. In all cases , , P=8, and Q=8. The version of the algorithm that stores the matrix in pivoted form was used. Note that the M=5000 and 8000 cases ran in-core, and that the M=8000 case failed. The out-of-core matrix was physically and logically distributed.

To understand the effect of varying the superblock width on the time for the triangular solve, matrix multiplication, and factorization phases of the algorithm we derive the following expressions for the number of floating-point operations in each phase,

displaymath1698

These expressions apply in the sequential case ( ), but the corresponding expression for the parallel algorihm is obtained by replacing by . It should be noted that the total floating-point operation count for all three computational phases is , but the above expressions show that the way these operations are distributed among the phases depends on the width of the superblock, . Thus, an increase in the superblock width results in an increase in the factorization time, and a decrease in the time for matrix multiplication. If the superblock width is sufficiently small compared with the matrix size then a small increase results in an increase in the triangular solve time. However, if the superblock width is large an increase will decrease the triangular solve time. It should be remembered that all three of these phases are running in parallel so communication time also influences the total running time. In general, increasing the or should decrease communication time on the Paragon as data are communicated in larger blocks. If the times for the computational phases in Tables 3 and 4 are summed we get about 524 seconds for and about 432 seconds for which suggests that a larger value of results in more efficient parallel compputation overall. Communication overhead, together with the floating-point operation count, determines the performance of the computational phases of the algorithm as changes.

The failure of the M=8000 case in Table 3 prompted us to devise a second way of implementing logically distributed files. Instead of opening a separate file for each process, the new method opens a single file and divides it into blocks, assigning one block to each process. This does not change the user interface to the BLAPIOS described in Sec. 5. We refer to this type of file as a physically shared, logically distributed file. It should be noted that the terms ``physically shared'' and ``physically distributed'' refer to the view of the parallel file system from within the BLAPIOS. At the hardware level the file, or files, may be striped across multiple disks, as is the case for the Intel Paragon.

The rest of the results presented in this section are for physically shared, logically distributed files, and the version of the algorithm that stores the matrix in pivoted form. In Tables 5 and 6 results are presented for the same problems on and process meshes. It is interesting to note that increasing the number of processors from 16 to 64 results in only a very small decrease in the time for the triangular solve phase, indicating that the parallel efficiency for this phase is low. This is in contrast with the matrix multiplication phase which exhibits almost perfect speedup.

table634
Table: Timings in seconds for the main phases of out-of-core LU factorization of matrices. Results are shown for M=5000, 8000 and 10000. In all cases , , P=4, and Q=4. The version of the algorithm that stores the matrix in pivoted form was used. The out-of-core matrix was logically distributed, but physically shared.

table652
Table: Timings in seconds for the main phases of out-of-core LU factorization of matrices. Results are shown for M=5000, 8000 and 10000. In all cases , , P=8, and Q=8. The version of the algorithm that stores the matrix in pivoted form was used. The out-of-core matrix was logically distributed, but physically shared.

In Table 7 timings are presented for the case for an process mesh. Comparing these results first with those given in Table 4 for a physically and logically distributed file, the decrease in the times for reading and writing is striking. Secondly, of course, the physically shared case no longer fails for the M=8000 in-core case. Comparison between Tables 6 and 7 shows that a for physically shared file an increase in results in a decrease in I/O time, as expected from the dependency of the I/O time on . However, the decrease is less than the expected factor of 5, particularly for the writes. Results in Table 8 for the case show a read time for the M=10000 case which is about the same as for , and a write time that is substantially less. This again shows that as increases, thereby increasing the amount of data being read and written in each I/O operation, I/O performance starts to degrade quite significantly once is sufficiently large.

Table 8 shows timings for the M=10000 case for the same problem parameters as in Table 7, but for . Comparing the results in Tables 6, 7, and 8 we see that the time for writing data does not decrease montonically as increase, but is smallest for . Again we ascribe this behavior to the apparent degradation in I/O performance when the volume of simultaneous I/O is large.

table680
Table: Timings in seconds for the main phases of out-of-core LU factorization of matrices. Results are shown for M=5000, 8000 and 10000. In all cases , , P=8, and Q=8. The version of the algorithm that stores the matrix in pivoted form was used. Note that the M=5000 and 8000 cases ran in-core. The out-of-core matrix was logically distributed, but physically shared.

table698
Table: Timings in seconds for the main phases of out-of-core LU factorization of matrices. Results are shown for M=10000 with , , P=8, and Q=8. The version of the algorithm that stores the matrix in pivoted form was used. The out-of-core matrix was logically distributed, but physically shared.

Next: Summary and Conclusions Up: Key Concepts For Parallel Previous: A Parallel Algorithm

Jack Dongarra
Thu Apr 18 21:51:24 EDT 1996