Abstract for nock_thesis

PhD Thesis, Cambridge University


H.J. Nock

May 2001

Systems which automatically transcribe carefully dictated speech are now commercially available, but their performance degrades dramatically when the speaking style of users becomes more relaxed or conversational. This dissertation focuses on techniques that aim to improve the robustness of statistical speech transcription systems to conversational speaking styles.

The dissertation shows first that the performance degradation occuring as speech becomes more conversational is severe and is partially attributable to differences in the acoustic realizations of sentences. Hypothesizing that the quantifiably wider range of pronunciation in conversational speech contributes to these differences, the dissertation then focuses on techniques for modelling the phonological processes underlying pronunciation change. Such techniques may be classified as explicit (operating at or close to the level of the word pronunciation dictionary) or implicit (operating at or close to the subword statistical models of the acoustic signal) and both types are considered.

An existing explicit technique, motivated by linear phonology and originally evaluated on a dictated speech task, has recently been extended for conversational speech tasks. Rather than model pronunciations using phonemic units (which are by definition abstract units with highly variable acoustic realizations), a statistical mapping is constructed from the abstract phonemic units to their context-dependent realizations as surface phonetic units (which are by definition less abstract and less variable in acoustic realizations). If the map from phonemic units to phonetic realizations is sufficiently accurate, the task of modelling the acoustic realizations of words should be simplified. Small but statistically significant performance improvements can be obtained on the SWITCHBOARD transcription task. However, further experiments by the author and by other researchers suggest that schemes modelling pronunciation change in terms of speech segments have only limited potential.

This analysis suggests a more implicit approach capable of describing variable degrees of pronunciation change at levels below the segment may be more appropriate. This motivates investigation into a family of statistical models that could form the basis of such an approach: Loosely-coupled or Factorial Hidden Markov Models (FHMMs). The theory of FHMMs is described and it is then shown that they generalize several standard speech models. Two specific FHMMs are investigated. Analysis of an existing FHMM in the literature - the Mixed-Memory Assumption FHMM - finds it has potential weaknesses for speech modelling. This leads us to propose a new FHMM - the Parameter-Tied FHMM - which makes fewer a-priori assumptions about the data to be modelled. Estimation and decoding of FHMMs is potentially computationally expensive, so approximate algorithms are also developed. Empirical studies using the ISOLET speech classification task show (1) FHMMs scale to speech modelling tasks (2) the Parameter-Tied FHMM achieves performance comparable to the Mixed-Memory Assumption FHMM for speech modelling and (3) identify an approximate algorithm for decoding and estimation that is adequate for more extensive experimentation. A short study using the TI DIGITS task shows that FHMMs can be scaled to continuous speech recognition whilst continuing to achieve classification performance competitive with more conventional models.

The thesis ends with a summary and possible directions for future research.

Keywords: speech recognition, pronunciation variability, pronunciation modelling, decision tree pronunciation model, Hidden Markov Model, Factorial Hidden Markov Model, multiple loosely-coupled time series, variational approximation, chainwise Viterbi algorithm, ISOLET, TI DIGITS, SWITCHBOARD, MULTIREG.

(ftp:) nock_thesis.pdf (http:) nock_thesis.pdf

If you have difficulty viewing files that end '.gz', which are gzip compressed, then you may be able to find tools to uncompress them at the gzip web site.

If you have difficulty viewing files that are in PostScript, (ending '.ps' or '.ps.gz'), then you may be able to find tools to view them at the gsview web site.

We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.