DISCRIMINATIVE METHODS IN HMM-BASED SPEECH RECOGNITION
Valtcho Valcthev
March 1995
Conventional speech recognition systems require information from two knowledge sources - a family of acoustic models and a language model. The acoustic models incorporate knowledge extracted from the speech waveform and they are commonly based on hidden Markov models (HMMs). HMMs have been used successfully for speech recognition for many years, however, in many respects the assumptions behind the HMM framework are poor. The following issues can be considered
1. HMMs are usually trained according to the Maximum Likelihood estimation (MLE) procedure whose optimality, in the sense of providing the lowest possible error rate, is based on assumptions which are never satisfied in practice. Discriminative training techniques remove the need for these assumptions and attempt to optimise an information-theoretic criterion which is related to the performance of the recogniser. Unfortunately, discriminative methods require substantially more computation than MLE and many previous implementations of such techniques have been based on the somewhat unreliable steepest-descent procedure.
2. In the HMM framework, the acoustic observations are assumed to be independent of each other, hence, speech dynamics cannot be modelled directly. Such information is typically provided in ``canned'' form by extending the feature vectors to accommodate differential components which reflect the changes in the standard coefficients. Although this approach results in improved recognition performance, it entails an increased number of model parameters and consequently longer training and recognition times. Another problem with differential coefficients is the assumption that the parameter slope/curvature are the only useful features.
In this dissertation we describe an implementation of the Maximum Mutual Information estimation (MMIE) discriminative training algorithm, where an approximate second-order optimisation scheme is employed to update the parameters of the HMMs. This algorithm is shown to provide improved recognition performance, achieved after a small number of iterations. A modification of the MMIE algorithm is also discussed whereby a different weighting is given to each utterance in the training set based on the mutual information measure.
The problem of providing compact and informative feature vectors is introduced and discussed in terms of feature selection and feature extraction algorithms. In this respect, the Minimal Mutual Information Change (MMIC) feature selection algorithm is proposed where the change in the mutual information criterion is used to rank-order the components of the feature vector.
Feature extraction techniques are investigated through the introduction of adaptive input transformations in the conventional HMM framework. Transformations of different topologies are initialised to perform meaningful parameter transformations. Subsequently, during training, the parameters of the transformations are optimised simultaneously with the HMM parameters according to a discriminative objective criterion.
The discriminative training algorithms are evaluated on a British English E-set task, an American alphabet recognition task, and a large continuous phone recognition task (TIMIT).
Keywords: speech recognition, hidden Markov models, discriminative training, feature extraction, feature selection, adaptive input transformations.
If you have difficulty viewing files that end '.gz'
,
which are gzip compressed, then you may be able to find
tools to uncompress them at the gzip
web site.
If you have difficulty viewing files that are in PostScript, (ending
'.ps'
or '.ps.gz'
), then you may be able to
find tools to view them at
the gsview
web site.
We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.