Next: Training the RNN Up: The use of recurrent Previous: Decoding Scaled Likelihoods

System Training

 

Training of the hybrid RNN/HMM system entails estimating the parameters of both the underlying Markov chain and the weights of the recurrent network. Unlike HMMs which use exponential-family distributions to model the acoustic signal, there is not (yet) a unified approach (e.g., EM algorithm [21]) to simultaneously estimate both sets of parameters. A variant of Viterbi training is used for estimating the system parameters and is described below.

The parameters of the system are adapted using Viterbi training to maximise the log likelihood of the most probable state sequence through the training data. First, a Viterbi pass is made to compute an alignment of states to frames. The parameters of the system are then adjusted to increase the likelihood of the frame sequence. This maximisation comes in two parts; (1) maximisation of the emission probabilities and (2) maximisation of the transition probabilities. Emission probabilities are maximised using gradient descent and transition probabilities through the re-estimation of duration models and the prior probabilities on multiple pronunciations. Thus, the training cycle takes the following steps:

  1. Assign a phone label to each frame of the training data. This initial label assignment is traditionally done by using hand-labelled speech (e.g., the TIMIT database).
  2. Based on the phone/frame alignment, construct the phone duration model and compute the phone priors needed for converting the RNN output to scaled likelihoods.
  3. Train the recurrent network based on the phone/frame alignment. This process is described in more detail in section 4.1.
  4. Using the parameters from 2. and the recurrent network from 3., apply Viterbi alignment techniques to update the training data phone labels and go to 2.

We generally find that four iterations of this Viterbi training are sufficient.





Tony Robinson Sun Jun 4 20:04:56 BST 1995