There are no general prescriptions for selecting an appropriate learning rate in back-propagation, in order to avoid oscillations and converge to a good local minimum of the energy in a short time. In many applications, some kind of ``black magic,'' or trial-and-error process, is employed. In addition, usually no fixed learning rate is appropriate for the entire learning session.
Both problems can be solved by adapting the learning rate to the local structure of the energy surface.
We start with a given learning rate (the value does not matter) and monitor the energy after each weight update. If the energy decreases, the learning rate for the next iteration is increased by a factor . Conversely, if the energy increases (an ``accident'' during learning), this is taken as an indication that the step made was too long, the learning rate is decreased by a factor , the last change is cancelled, and a new trial is done. The process of reduction is repeated until a step that decreases the energy value is found (this will inevitably happen because the search direction is that of the negative gradient). An example of the size of the learning rate as a function of the iteration number is shown in Figure 9.24.
Figure 9.24: Learning Rate Magnitude as a Function of the Iteration Number for a
Test Problem
The name ``Bold Driver'' was selected for the analogy with the learning process of young and inexperienced car drivers.
By using this ``brutally heuristic'' method, learning converges in a time that is comparable to, and usually better than that of standard (batch) back-propagation with an optimal and fixed learning rate. The important difference is that the time-consuming meta-optimization phase for choosing is avoided. The values for and can be fixed once and for all (e.g., , ) and performance does not depend critically on their choice.