Abstract for tuerk_thesis

PhD Thesis, University of Cambridge

THE STATE BASED MIXTURE OF EXPERTS HMM WITH APPLICATIONS TO THE RECOGNITION OF SPONTANEOUS SPEECH

A. Tuerk

September 2001

Although the performance of speech recognition systems has increased substantially over the last decades, there still remain a number of tasks which pose considerable problems for current state-of-the-art techniques. One of these tasks is the recognition of spontaneous speech which differs from read or planned speech in that its underlying dynamics change frequently over time. The negative effect of changes in acoustic background condition on recognition performance can also be observed in other situations as, for instance, in the case of speech that is corrupted by non-stationary noise.

This thesis is concerned with the development of an acoustic model for speech recognition which automatically detects changes in the background condition of a signal and compensates for the model-data mismatch by combining the information of several expert models. These experts are specialised on the different acoustic conditions under consideration and their influence on the recognition process is determined by how well their associated condition matches the signal. This approach gives rise to the state based mixture of experts hidden Markov model (SBME-HMM) which is studied in this thesis both theoretically and in a number of recognition experiments. In principle, the SBME-HMM can be applied to distinguish implicitly between an arbitrary set of discrete acoustic conditions. Since, however, the main focus in this thesis is the application of the SBME-HMM to spontaneous speech the only conditions considered here will correspond to speech at different speaking rates.

The SBME-HMM is an extension of the standard HMM which uses an additional hidden variable $v$ whose states are meant to correspond to the different acoustic conditions in the speech signal. The decision whether an acoustic condition is present is implemented in the SBME-HMM via a so-called indicating feature which is a continuous feature that contains information about the state of the hidden variable $v$. This information is expressed in the SBME-HMM by a posterior probability distribution over the states of $v$ given the indicating feature. The theoretical development of the SBME-HMM in this thesis concerns both the estimation of the model parameters with the EM algorithm and its application in speech recognition. Special attention is given to the estimation of the posterior probability distributions over the states of the hidden variable $v$ which are modelled by softmax functions with polynomial exponents. It is shown that training these functions with the EM algorithm leads to an optimisation problem which can be linked to the cross-entropy error function. Although there are no closed form reestimation formulae for the softmax parameters they can be estimated robustly with an iterative scheme like the Newton-Raphson algorithm with line search and back-tracking. This is due to the convexity of the cross-entropy error surface which is asserted by proving the positive de niteness of the Hessian of the cross-entropy error with respect to the softmax parameters. In addition, this thesis addresses the problem of initialising the SBME-HMM and develops two different methods, namely the median split initialisation (MSI) and the relabelled training initialisation (RTI) which can initialise indicating features whose output probability density functions (pdf's) are either Gamma densities or Gaussian mixtures. For recognition three different types of model topology are proposed that either use the state of the hidden variable $v$ explicitly in the recognition process or which sum over all the states of $v$ and thereby reduce the model topology to that of a standard HMM.

The indicating feature in the SBME-HMM is the main source of information about the state of $v$. To ensure that the latter model the acoustic conditions of interest and that the indicating feature distinguishes well between the different states of $v$ it is necessary to nd an indicating feature that is highly correlated with the acoustic conditions under consideration. Since this thesis applies the SBME-HMM to the recognition of spontaneous speech a number of indicating features are derived whose usefulness is evaluated in classi cation experiments on fast and slow instances of phones. Two of these features which show good classification performance are later used in recognition experiments.

The performance of the SBME-HMM's for different combinations of initialisation, model topology and indicating features is evaluated on the 1997 Broadcast News task which contains a substantial proportion of spontaneous speech. The SBME-HMM's in the recognition experiments are used to rescore cross-word triphone lattices and hidden variables $v$ with both 2 and 3 states are explored. The recognition experiments with a binary hidden variable show small but consistent improvements over a standard HMM with a comparable number of parameters giv- ing in one particular experiment a statistically signi cant overall improvement of 2.8% relative. For a variable $v$ with 3 states no additional improvements over the SBME-HMM's with a binary hidden variable can be observed. On the contrary, the high number of model parameters and the fast saturation of the training likelihoods with the EM algorithm lead to a small performance degradation.


| (ftp:) tuerk_thesis.ps.gz | (http:) tuerk_thesis.ps.gz | (ftp:) tuerk_thesis.pdf | (http:) tuerk_thesis.pdf |

If you have difficulty viewing files that end '.gz', which are gzip compressed, then you may be able to find tools to uncompress them at the gzip web site.

If you have difficulty viewing files that are in PostScript, (ending '.ps' or '.ps.gz'), then you may be able to find tools to view them at the gsview web site.

We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.