The multilayer perceptron, initialized with random weights, presents random output patterns. We would like to execute a learning stage, progressively modifying the values of the connection strengths in order to make the outputs nearer to the prescribed ones.
It is straightforward to transform the learning task into an optimization problem (i.e., a search for the minimum of a specified function, henceforth called energy). If the energy is defined as the sum of the squared errors between obtained and desired output pattern over the set of examples, minimizing it will accomplish the task.
A large fraction of the theoretical and applied research in supervised
learning is based on the steepest descent method for minimization.
The negative gradient of the energy with respect to the weights is
calculated during each iteration, and a step is taken in that direction (if
the step is small enough, energy reduction is assured). In this way, one
obtains a sequence of weight vectors, , that converges to a
local minimum of the energy function:
Now, given that we are interested in converging to the local minimum in the
shortest time (this is not always the case: to combat noise some slowness may
be desired), there is no good reason to restrict ourselves to steepest
descent, and there are at least a couple of reasons in favor of other
methods. First, the learning speed, , is a free
parameter that has to be chosen carefully for each problem (if it is too
small, progress is slow; if too large, oscillations may be created).
Second, even in the optimal case of a step along the steepest descent
direction bringing the system to the absolute minimum (along this
direction), it can be proved that steepest descent can be arbitrarily slow,
particularly when ``the search space contains long ravines that are
characterized by sharp curvature across the ravine and a gently sloping
floor'' [Rumelhart:86b]. In other words, if we are unlucky with the
choice of units along the different dimensions (and this is a frequent event
when the number of weights is 1000 or 10,000), it may be the case that during
each iteration, the previous error is reduced by 0.000000001%!
The problem is essentially caused by the fact that the gradient does not necessarily point in the direction of the minimum, as it is shown in Figure 9.23.
Figure 9.23: Gradient Direction for Different Choice of Units
If the energy is quadratic, a large ratio of the maximum to minimum eigenvalues causes the ``zigzagging'' motion illustrated. In the next sections, we will illustrate two suggestions for tuning parameters in an adaptive way and selecting better descent directions.