Abstract for niesler_thesis

PhD Thesis, Cambridge University

CATEGORY-BASED STATISTICAL LANGUAGE MODELS

Thomas Niesler

June 1997

Language models are computational techniques and structures that describe word sequences produced by human subjects, and the work presented here considers primarily their application to automatic speech-recognition systems. Due to the very complex nature of natural languages as well as the need for robust recognition, statistically-based language models, which assign probabilities to word sequences, have proved most successful. This thesis focuses on the use of linguistically defined word categories as a means of improving the performance of statistical language models. In particular, an approach that aims to capture both general grammatical patterns as well as particular word dependencies using different model components is proposed, developed and evaluated. To account for grammatical patterns, a model employing variable-length n-grams of part-of-speech word categories is developed. The often local syntactic patterns in English text are captured conveniently by the n-gram structure, and reduced sparseness of the data allows larger n to be employed. A technique that optimises the length of individual n-grams is proposed, and experimental tests show it to lead to improved results. The model allows words to belong to multiple categories in order to cater for different grammatical functions, and may be employed as a tagger to assign category classifications to new text. While the category-based model has the important advantage of generalisation to unseen word sequences, is is by nature not able to capture relationships between particular words. An experimental comparison with word-based n-gram approaches reveals this ability to be important to language model quality, and consequently two methods allowing the inclusion of word relations are developed. The first allows the incorporation of selected word n-grams within a backoff framework. The number of word n-grams added may be controlled, and the resulting tradeoff between size and accuracy is shown to surpass that of standard techniques based on n-gram cutoffs. The second technique addresses longer-range word-pair relationships, that arise due to factors such as the topic or the style of the text. Empirical evidence is presented of an approximately exponentially decaying behaviour when considering the probabilities of related words as a function of an appropriately defined separating distance. This definition, which is fundamental to the approach, is made in terms of the category assignments of the words. It minimises the effect syntax has on word co-occurrences while taking particular advantage of the grammatical word classifications implicit in the operation of the category model. Since only related words are treated, the model size may be constrained to reasonable levels. Methods by means of which related word pairs may be identified from a large corpus, as well as techniques allowing the estimation of the parameters of the functional dependence are presented, and shown to lead to performance improvements. The proposed combination of the three modelling approaches is shown to lead to considerable perplexity reductions, especially for sparse training sets. Incorporation of the models has lead to a significant improvement in the word error rate of a high-performance baseline speech-recognition system.

(ftp:) niesler_thesis.ps.gz (http:) niesler_thesis.ps.gz
PDF (automatically generated from original PostScript document - may be badly aliased on screen):
(ftp:) niesler_thesis.pdf | (http:) niesler_thesis.pdf

If you have difficulty viewing files that end '.gz', which are gzip compressed, then you may be able to find tools to uncompress them at the gzip web site.

If you have difficulty viewing files that are in PostScript, (ending '.ps' or '.ps.gz'), then you may be able to find tools to view them at the gsview web site.

We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.