IMPROVED ACOUSTIC MODELLING FOR HMMS USING LINEAR TRANSFORMATIONS
Chris J Leggetter
Hidden Markov models (HMMs) have been used successfully for speech recognition for many years. However, in some respects the assumptions behind HMM models are poor. HMMs model only the within-class data and no attempt is made at discriminating between classes. This is a problem, especially in speaker independent systems where a wide variety of speakers may be used. This thesis considers the problem of acoustic modelling in speaker independent systems in two ways: (a) by incorporating discrimination into the HMM framework; and, (b) by adapting the HMMs to the chosen speaker (speaker adaptation). In both cases, linear transformation methods are proposed which aim to tune the model parameters on a class-specific basis to improve the modelling. Particular emphasis is placed on applications in large vocabulary continuous speech.
The acoustic-class discrimination problem is addressed at the HMM state level by considering the feature space representation of each class. Confusable class distributions are identified and class-specific mappings in the form of linear transforms are used to separate the within-class data from confusable data. The transforms reduce the dimensionality of the feature space so that those elements in the feature space which are confusable are discarded. Two methods of identifying confusable distributions are considered, one data-driven using data from the training set, and the second based on the distances between class distributions.
For speaker adaptation, a new approach using transformations, termed maximum likelihood linear regression (MLLR), is derived. Transforms are associated with each component distribution within the HMM system and estimated using a maximum likelihood approach similar to standard HMM parameter estimation. The transforms capture the general speaker characteristics between the current system parameters and the new speaker. A flexible form of tying of transforms is derived to make efficient use of the available data, allowing adaptation on small amounts of example speech. The method can be implemented in supervised or unsupervised adaptation modes and the flexible framework can be extended for incremental adaptation.
The discriminative and speaker adaptation transformations have been evaluated on a 1000 word task (Resource Management), and the speaker adaptation has also been evaluated on a larger 5000 word task (Wall Street Journal).
If you have difficulty viewing files that end
which are gzip compressed, then you may be able to find
tools to uncompress them at the gzip
If you have difficulty viewing files that are in PostScript, (ending
'.ps.gz'), then you may be able to
find tools to view them at
We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.