In this section, we shall examine techniques for optimizing the basic
LU factorization code presented in Section 4.1.
Among the issues to be
considered are the assignment of processes to physical processors, the
arrangement of the data in the local memory of each process, the trade-off
between load imbalance and communication latency, the potential for overlapping
communication and calculation, and the type of algorithm used to broadcast data.
Many of these issues are interdependent, and in addition the portability
and ease of code maintenance and use must be considered. For further
details of the optimization of parallel
LU factorization algorithms for specific concurrent machines, together with
timing results, the reader is referred to the work of Chu and George
[12], Geist and Heath
[32], Geist and Romine
[33], Van de Velde
[48], Brent
[8], Hendrickson and Womble
[35], Lichtenstein and Johnsson
[41], and Dongarra and co-workers
[10, 24].