ARTICULATORY METHODS FOR SPEECH PRODUCTION AND RECOGNITION
C.S. Blackburn
February 1997
The past 15 years have seen dramatic improvements in the performance of computer algorithms which attempt to recognise human speech. The falling error rates achieved by the best speech recognition systems on limited tasks have recently enabled the development of a diverse range of applications which promise to have a significant impact on many aspects of society. Examples of these range from dictation systems for personal computers to automated over-the-telephone enquiry services and interactive voice-controlled computing and mobility aids for disabled users.
Engineering research into the recognition of acoustic signals has focused on the development of efficient, trainable models which are adapted to specific recognition tasks. While the acoustic signal parameterisations employed are usually chosen to crudely model the behaviour of the human auditory system, little or no use is typically made of knowledge regarding the mechanisms of speech production.
Physical and inertial constraints on the movement of articulators in the vocal tract cause variations in the acoustic realisations of sounds according to their phonetic contexts. The difficulty of accurately modelling these contextual variations in the frequency domain represents a fundamental limitation on the performance of existing recognition systems.
This dissertation describes the design and implementation of a self-organising articulatory speech production model which attempts to incorporate production-based knowledge into the recognition framework. By using an explicit time-domain articulatory model of the mechanisms of co-articulation, it is hoped to obtain a more accurate model of contextual effects in the acoustic signal, while using fewer parameters than traditional acoustically-driven approaches.
Separate articulatory and acoustic models are provided, and in each case the parameters of the models are automatically optimised over a training data set. A predictive statistically-based model of co-articulation is described, and found to yield improved articulatory modelling accuracy compared with X-ray articulatory traces. Parameterised acoustic vectors are synthesised by a set of artificial neural networks, and the resulting acoustic representations are used to re-score $N$-best recognition hypothesis lists produced by an HMM-based recogniser. The system is evaluated on two test databases, one including speaker-specific X-ray training data and the other acoustic data alone, and improvements in word recognition accuracy are obtained in each case.
If you have difficulty viewing files that end '.gz'
,
which are gzip compressed, then you may be able to find
tools to uncompress them at the gzip
web site.
If you have difficulty viewing files that are in PostScript, (ending
'.ps'
or '.ps.gz'
), then you may be able to
find tools to view them at
the gsview
web site.
We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.