13.2.7 Static Performance Estimator

Next: 13.2.8 Conclusion Up: A Software Tool Previous: 13.2.6 Communication Analysis Algorithm

13.2.7 Static Performance Estimator

Given the results of the communication analysis in a program segment, the performance estimator can be used to predict the performance of that program segment on the target machine. The realization of such an estimator requires a simple static model of performance that is based on (1) target machine parameters such as the number of processors, the message startup and transmission costs, and the average times to perform different floating-point operations; (2) the size of the input data set; and (3) the data partitioning scheme.

We undertook a study of published performance models [Chen:88b], [Fox:88a], [Gustafson:88a], [Saltz:87b] for use in the performance estimator, and noticed that these theoretical models did not give accurate predictions in many cases. We concluded that the theoretical models suffered from the following deficiencies:\

Most of the models suggested in the literature were aimed at being ``general-purpose,'' that is, intended to model the performance of any distributed-memory MIMD computer. This generality created problems in some cases, when machine-specific peculiarities tended to skew the observed results from the ones predicted by the model.
The models also did not account for all of the software overhead involved in implementing the low-level communication utilities on the machine. While the models accounted for things like message startup costs and packetizing costs, they often ignored factors such as internal buffer sizes, peculiarities of the algorithms used to implement the message passing protocols, and so on.

Our effort to correct these defects resulted in an increased complexity of the model, and also necessitated the introduction of several machine-specific features. We felt that this was undesirable, and decided to investigate alternative methods [Balasundaram:90d].

We constructed a program that tested a series of communication patterns using a set of basic low-level portable communication utilities. This program, called a ``training set,'' is executed once on the target machine. The program computes timings for the different communication operations and averages them over all the processors. These timings are determined for a sequence of increasing data sizes. Since the graph of communication cost versus data size is usually a linear function, it can easily be described by specifying a few parameters (e.g., the slope). The training set thus generates a table whose entries contain the minimal information necessary to completely define the performance characteristic for each communication utility. This table is used in place of the theoretical model for the purposes of performance prediction.

Figure 13.10 shows some communication cost characteristics created using a part of our training set on 32 processors of an nCUBE. The data space was assumed to be a two-dimensional array that was partitioned columnwise; that is, each processor was assigned a set of consecutive columns. The communication utilities tested here are:\

iSR: nearest-neighbor individual element send and receive, using the EXPRESS calls exwrite() and exread().
vSR: nearest-neighbor vector send and receive along one direction, using the EXPRESS calls exvwrite() and exvread().
EXCH1: nearest-neighbor vector exchange along one direction, using the EXPRESS call exvchange().
vSRSR: nearest-neighbor vector sends and receives along two directions, using the EXPRESS calls exvwrite() and exvread().
EXCH2: nearest-neighbor vector exchange along two directions, using two calls to exvchange().
COMBN: combine operation over all processors, using the EXPRESS call excombine().
BCAST: one to all broadcast , using the EXPRESS call exbroadcast().

Figure 13.10: Communication cost characteristics of some EXPRESS utilities on the nCUBE

The table generated by the training set for the characteristics shown in Figure 13.10 is:\

The communication cost estimate for a particular data size is then calculated using the formula:\

where ``pkt size'' is the size of each message packet, which on the nCUBE is 1024 bytes ().

The static performance model is meant primarily to help the programmer discriminate between different data partitioning schemes. Our approach is to provide the programmer with the necessary tools to experiment with several data partitioning strategies, until he can converge on the one that is likely to give him a satisfactory performance. The tool provides feedback information about performance estimates each time a partitioning is done by the programmer.

Next: 13.2.8 Conclusion Up: A Software Tool Previous: 13.2.6 Communication Analysis Algorithm

Guy Robinson
Wed Mar 1 10:19:35 EST 1995