As we explained in an earlier section, slaves get work from their masters in a self-scheduled way in order to achieve a simple type of load balancing. This turns out not to be enough, however. By the nature of alpha-beta, the time necessary to search two different subtrees of the same depth can vary quite dramatically. A factor of 100 variation in search times is not unreasonable. Self-scheduling is somewhat helpless in such a situation. In these cases, a single slave would have to grind out the long search, while the other slaves (and conceivably, the entire rest of the machine) would merely sit idle. Another problem, near the bottom of the chess tree, is the extremely rapid time scales involved. Not only do the search times vary by a large factor, but this all happens at millisecond time scales. Any load-balancing procedure will therefore need to be quite fast and simple.
These ``chess hot spots'' must be explicitly taken care of. The master and submaster processors, besides just waiting for search answers, updating alpha-beta bounds, and so forth, also monitor what is going on with the slaves in terms of load balance. In particular, if some minimum number of slaves are idle and if there has been a search proceeding for some minimum amount of time, the master halts the search in the slave containing the hot spot, reorganizes all his idle slaves into a large team, and restarts the search in this new team. This process is entirely local to this master and his slaves and happens recursively, at all levels of the processor tree.
This ``shoot-down'' procedure is governed by two parameters: the minimum number of idle slaves, and the minimum time before calling a search a potential hot spot. These parameters are introduced to prevent the halting, processor rearrangement, and its associated overhead in cases which are not necessarily hot spots. The parameters are tuned for maximum performance.
The payoff of dynamic load balancing has been quite large. Once the load-balancing code was written, debugged, and tuned, the program was approximately three times faster than before load balancing. Through observations of the speedup (to be discussed below), and also by looking directly at the execution of the program across the nCUBE (using the parallel graphics monitor, also to be discussed below) we have become convinced that the program is well load balanced and we are optimistic about the prospects for scaling to larger speedups on larger machines.
An interesting point regarding asynchronous parallel programming was brought forth by the dynamic load-balancing procedure. It is concerned with the question, ``Once we've rearranged the teams and completed the search, how do we return to the original hierarchy so as to have a reasonable starting point for the next search?'' Our first attempts at resetting the processor hierarchy met with disaster. It turned out that processors would occasionally not make it back into the hierarchy (that is, be the slave of someone) in time for another search to begin. This happened because of the asynchronous nature of the program and the variable amount of time that messages take to travel through the machine. Once this happened, the processor would end up in the wrong place in the chess tree and the program would soon crash. A natural thing to try in this case is to demand that all processors be reconnected before beginning a new search but we rejected this as being tantamount to a global resynchronization and hence very costly. We therefore took an alternate route whereby the code was written in a careful manner so that processors could actually stay disconnected from the processor tree, and the whole system could still function correctly. The disconnected processor would reconnect eventually-it would just miss one search. This solution seems to work quite well both in terms of conceptual simplicity and speed.