Next: Weight Update Up: Training the RNN Previous: RNN Objective Function

Gradient Computation

Given the objective function, the training problem is to estimate the weights to minimise (11). Of the known algorithms for training recurrent nets, back-propagation through time (BPTT) was chosen as being the most efficient in space and computation [22,23]. The basic idea behind BPTT is illustrated in figure 5. The figure shows how the recurrent network can be expanded (in time) to represent an MLP where the number of hidden layers in the MLP is equal to the number of frames in the sequence. Training of the expanded recurrent network can be carried out in the same fashion as for an MLP (i.e., using standard error back-propagation [22]) with the constraint that the weights at each layer are tied. In this approach, the gradient of the objective function with respect to the weights (i.e., and ) is computed using the chain-rule for differentiation.

An overview of the gradient computation process for a sequence of N frames can be described as follows*:

  1. Initialise the initial state .
  2. For , compute and by forward propagating and as specified in (4)--(6).
  3. Set the error on the final state vector to zero as the objective function does not depend on this last state vector. Set the error on the output nodes to be the target value given by the Viterbi alignment less the actual output, , as in normal back-propagation training.
  4. For , back-propagate the error vector back through network. The error corresponding to the outputs is specified by the Viterbi alignment, while the error corresponding to the state are computed in the same way as backpropagation of the error to hidden units in a MLP.
  5. Compute the gradient of the objective function with respect to the weights by accumulating over all frames.

 

 


: The expanded recurrent network.

Note that the state units have no specific target vector. They are trained in the same way as hidden units in a feedforward network and so there is no obvious ``meaning'' that can be assigned to their values. It should be pointed out that the proposed method is subject to boundary effects in that the frames at the end of a buffer do not receive an error signal from beyond the buffer. Although methods exist to eliminate these effects (e.g., [23]), in practice it is found that the length of the expansion (typically 256 frames) is such that the effects are inconsequential.

Next: Weight Update Up: Training the RNN Previous: RNN Objective Function


Tony Robinson Sun Jun 4 20:04:56 BST 1995