Next: Special Features
Up: Training the RNN
Previous: Gradient Computation
There are a number of ways in which the gradient signal can be employed to
optimise the network. The approach described here has been found to be the
most effective in estimating the large* number of parameters of the recurrent network. On each
update, a local gradient,
, is
computed from the training frames in the nth subset of the training data.
A positive step size,
, is maintained for every weight
and each weight is adjusted by this amount in the direction opposite to the
smoothed local gradient, i.e.,

The local gradient is smoothed using a ``momentum'' term by
The smoothing parameter,
, is automatically increased from an
initial value of
to
by
where N is the number of weight updates per pass through the training
data. The step size is geometrically increased by a factor
if the
sign of the local gradient is in agreement with the averaged gradient,
otherwise it is geometrically decreased by a factor
, i.e.,

In this way, random gradients produce little overall change.
This approach is similar to the method proposed by Jacobs [24] except that a stochastic gradient signal is used and both the increase and decrease in the scaling factor is geometric (as opposed to an arithmetic increase and geometric decrease). Considerable effort was expended in developing this training procedure and the result was found to give better performance than the other methods that can be found in the literature. Other surveys of ``speed-up'' techniques reached a similar conclusion [25,26].