Here we give some details of our results and compare with the standard approach. As mentioned in the previous section, a MLP (1024 inputs, 32 hidden units, 26 output units) was trained on the set of Figure 6.32 using the multiscale method. Outputs are never exactly 0 or 1, so we defined a ``successful'' recognition to occur when the output value of the desired letter was greater than 0.9, and all other outputs were less than 0.1. The training on the grid used back-propagation with a momentum term and went through the exemplars sequentially. The weights are changed to reduce the error function for the current character. The result is that the system does not reach an absolute minimum. Rather, at long times the weight values oscillate with a period equal to the time of one sweep through all the exemplars. This is not a serious problem as the oscillations are very small in practice. Figure 6.34 shows the training curve for this problem. The first part of the curve is the training of the network; even though the grid is a bit coarse, almost all of the characters can be memorized. Proceeding to the next grid by scaling the mesh size by a factor of two and using the exemplars, we obtained the second part of the learning curve in Figure 6.34. The net got 315/320 correct. After 12 additional sweeps on the net, a perfect score of 320/320 is achieved. The third part of Figure 6.34 shows the result of the final boost to . In just two cycles on the net, a perfect score of 320/320 was achieved and the training was stopped. It is useful to compare these results with a direct use of back-propagation on the mesh without using the multiscale procedure. Figure 6.35 shows the corresponding learning curve, with the result from Figure 6.34 drawn in for comparison. Learning via the multiscale method takes much less computer time. In addition, the internal structure of the resultant network is much different and we will now turn to this question.
How do these two networks compare for the real task of recognizing exemplars not belonging to the training set? We used as a generalization test set 156 more handwritten characters. Though there are no ambiguities for humans in this test set, the networks did make mistakes. The network from the direct method made errors 14% of the time, and the multiscale network made errors 9% of the time. We feel the improved performance of the multiscale net is due to the difference in quality of the feature extractors in the two cases. In a two-layer MLP, we can think of each hidden-layer neuron as a feature extractor which looks for a certain characteristic shape in the input; the function of the output layer is then to perform the higher level operation of classifying the input based on which features it contains. By looking at the weights connecting a hidden-layer neuron to the inputs, we can determine what feature that neuron is looking for.
Figure 6.34: The Learning Curve for our Multiscale Training Procedure
Applied to 320 Handwritten Characters. The first part of the curve is
the training on the net, the second on the
net, and the last on the full, net. The curve is plotted
as a function of CPU time and not sweeps through the presentation set,
in order to exhibit the speed of training on the smaller networks.
Figure: A Comparison of Multiscale Training with the Usual, Direct
Back-propagation Procedure. The curve labelled ``Multiscale'' is the same as
Figure 6.34, only rescaled by a factor of two. The curve
labelled ``Brute Force'' is from directly training a network,
from a random start, on the learning set. The direct approach does not quite
learn all of the exemplars, and takes much more CPU time.
For example, Figure 6.36 shows the input weights of two neurons in the net. The neuron of (a) seems to be looking for a stroke extending downward and to the right from the center of the input field. This is a feature common to letters like A, K, R, and X. The feature extractor of (b) seems to be a ``NOT S'' recognizer and, among other things, discriminates between ``S'' and ``Z''.
Figure 6.36: Two Feature Extractors for the Trained net. This figure
shows the connection weights between one hidden-layer, and all the
input-layer neurons. Black boxes depict positive weights, while white
depict negative weights; the size of the box shows the magnitude. The
position of each weight in the grid corresponds to the position
of the input pixel. We can view these pictures as maps of the features which
each hidden-layer neuron is looking for. In (a), the neuron is looking for a
stroke extending down and to the right from the center of the input field;
this neuron fires upon input of the letter ``A,'' for example. In (b), the
neuron is looking for something in the lower center of the picture, but it
also has a strong ``NOT S'' component. Among other things, this neuron
discriminates between an ``S'' and a ``Z''. The outputs of several such
feature extractors are combined by the output layer to classify the original
input.
Figure: The Same Feature Extractor as in Figure 6.36(b),
after the Boost to . There is an obvious correspondence between
each connection in Figure 6.36(b) and clumps
of connections here. This is due to the multiscale procedure, and leads to
spatially smooth feature extractors.
Even at the coarsest scale, the feature extractors usually look for blobs rather than correlating a scattered pattern of pixels. This is encouraging since it matches the behavior we would expect from a ``good'' character recognizer. The multiscale process accentuates this locality, since a single pixel grows to a local clump of four pixels at each rescaling. This effect can be seen in Figure 6.37, which shows the feature extractor of Figure 6.36(b) after scaling to and further training. Four-pixel clumps are quite obvious in the network. The feature extractors obtained by direct training on large nets are much more scattered (less smooth) in nature.