next up previous
Next: Fault Tolerance Up: Load Balancing in Previous: Raw Performance.

The Workload Model

Workload parameters are the only dynamic server-dependent parameters required to perform the computation of the predicted execution time T.

Each instance of the agent possesses a cached value of the workload of every computational server. By cached, we mean that this value is directly used for T's computation and that it is updated only periodically. Admittedly, this value may be out of date and lead to an occasional wrong estimate of T. Nevertheless, we prefer on the average to take the risk of having a wrong estimate than to pay the cost for getting a constantly accurate one.

We emphasize that we have tried to make this estimate as accurate as possible, while minimizing the cost of its computation. Figure 3 shows the scheme we used to manage the workload broadcast.

 
Figure 3: Workload Policy in NetSolve 

Let us consider a computational server M and an instance of the agent C. C performs the computation of T according to the last value of M's workload it knows. M broadcasts its workload periodically. In Figure 3, we call time slice the delay between to workload broadcast from M. This figure shows the workload function of M versus the time. The simplest solution would be to broadcast the workload at the beginning of each time slice. However, experience proves that the workload of a machine can stay the same for a very long time. Therefore, most of the time, the same value would be broadcast again and again over the network. To avoid this useless communication, we chose to broadcast the workload only when it has significantly changed. In the figure, we see some shaded areas called the confidence interval. Basically, each time the value of the workload is broadcast, the NetSolve computational server decides that the next value to be broadcast should be different enough from the last broadcast one---in other words, outside this confidence interval. In Figure 3, the workload is broadcast three times during the first five time slices.

Two parameters are involved in this workload management: the width of a time slice and the width of the confidence interval. These parameters must be chosen carefully. A time slice that is too narrow causes the workload to be assessed often, which is costly in term of CPU cycles. We have to remember that a NetSolve server is supposed to run on a host for a long period of time; it is impossible to let it monopolize a lot of CPU time. The width of the confidence level must also be considered carefully. A narrow confidence interval causes a lot of useless workload broadcasting, which is costly in term of network bandwidth.

Choosing an effective time slice and confidence interval serves another function. It helps to make the workload information on the instances of the agent as accurate as possible, so that the estimated value of T is reasonable. We emphasize that experimentation is needed to determine the most suitable time slice and confidence intervals. A possibility investigated at the moment would be to have each server dynamically tune its confidence interval and time slice at runtime.



next up previous
Next: Fault Tolerance Up: Load Balancing in Previous: Raw Performance.



Joint Institute for Computational Science
Mon Apr 29 13:00:40 EDT 1996