We would like to train large, high-resolution nets. If one tries to do this directly, by simply starting with a very large network and training by the usual back-propagation methods, not only is the training slow (because of the large size of the network), but the generalization properties of such nets are poor. As described above, a large net with many weights from the input layer to the hidden layer tends to ``grandmother'' the problem, leading to poor generalization.
The hidden units of an MLP form a set of feature extractors. Considering a complex pattern such as a Chinese character, it seems clear that some of the relevant features which distinguish it are large, long-range objects requiring little detail while other features are fine scale and require high resolution. Some sort of multiscale decomposition of the problem therefore suggests itself. The method we will present below builds in long-range feature extractors by training on small networks and then uses these as an intelligent starting point on larger, higher resolution networks. The method is somewhat analogous to the multigrid technique for solving partial differential equations.
Let us now present our multiscale training algorithm. We begin with the training set, such as the one shown in Figure 6.32, defined at the high resolution (in this case, ). Each exemplar is coarsened by a factor of two in each direction using a simple grey scale averaging procedure. blocks of pixels in which all four pixels were ``on'' map to an ``on'' pixel, those in which three of the four were ``on'' map to a ``3/4 on'' pixel, and so on. The result is that each exemplar is mapped to a exemplar in such a way as to preserve the large scale features of the pattern. The procedure is then repeated until a suitably coarse representation of the exemplars is reached. In our case, we stopped after coarsening to .
At this point, an MLP is trained to solve the coarse mapping problem by one's favorite method (back-propagation, simulated annealing, and so on). In our case, we set up an MLP of 64 inputs (corresponding to ), 32 hidden units, and 26 output units. This was then trained on the set of 320 coarsened exemplars using the simple back propagation method with a momentum term [Rumelhart:86a], Chapter 8. Satisfactory convergence was achieved after approximately 50 cycles through the training set.
We now wish to boost back to a high-resolution MLP, using the results of the coarse net. We use a simple interpolating procedure which works well. We leave the number of hidden units unchanged. Each weight from the input layer to the hidden layer is split or ``un-averaged'' into four weights (each now attached to its own pixel), with each 1/4 the size of the original. The thresholds are left untouched during this boosting phase. This procedure gives a higher resolution MLP with an intelligent starting point for additional training at the finer scale. In fact, before any training at all is done with the MLP (boosted from ), it recalls the exemplars quite well. This is a measure of how much information was lost when coarsening from to . The boost and train process is repeated to get to the desired MLP. The entire multiscale training process is illustrated in Figure 6.33.
Figure 6.33: An Example Flowchart for the Multiscale Training Procedure. This was the procedure used in this text, but the averaging and boosting can be continued through an indefinite number of stages.