The Recurrent Network

Next: Overview of the Up: THE BASIC SYSTEM Previous: Automatic scaling of

The Recurrent Network

The recurrent network falls into the framework described by Rumelhart, Hinton and Williams [4]. It may be viewed as a single layer error propagation (back propagation) network, part of whose output is fed back to the input after a single frame time delay. This is shown in figure 1 where the external input, , and the state input, , together form the input vector, the output vector being composed of the external output, , and the state output, . In practice, the external output is not trained to classify the current input vector, , but that of n frames previously, . This is to allow some forward context in the classification, backward context is already available through the state vector. For most of these experiments, a four frame delay was used which corresponds to 64ms.

Figure 1: The recurrent network

The ``time-expansion'' or ``batch'' method of training recurrent networks is adopted for computational efficiency reasons. After 32 frames on each of the 64 transputers, the actual outputs are compared with the desired outputs and the contribution of these patterns to the gradient of the cross-entropy cost function is computed. Cross-entropy is used both because of the interpretation of the output units as probabilities and because it is found to reduce the training time needed.

An adaptive step size algorithm was necessary to achieve training in reasonable time. Each weight has a step size associated with it and the weight is changed by this amount in the direction of the locally computed gradient. If this gradient agrees in sign with the gradient when smoothed with a first order filter over the whole of the training set, then the step size is increased, otherwise it was decreased. In most experiments, the increase was multiplication by a factor of 1.116 and the decrease was a factor of 0.9. In two cases, pzc and pow, this proved to be unstable, and increases of 1.1155 and 1.113 were used, respectively.

For the majority of the work presented in this paper, 96 state nodes were used, which yields about 20,000 weights to be trained. 32 passes through the training set were found to be sufficient and this could be achieved in 17 hours on the transputer array.

Next: Overview of the Up: THE BASIC SYSTEM Previous: Automatic scaling of