AUTOMATIC TRANSCRIPTION OF CONVERSATIONAL TELEPHONE SPEECH - DEVELOPMENT OF THE CU-HTK 2002 SYSTEM
Thomas Hain, Phil Woodland, Gunnar Evermann, Mark Gales, Andrew Liu, Gareth Moore, Dan Povey & Lan Wang
December 2003
This report discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modelling and model training, language and pronunciation modelling are presented. These include the use of conversation side based cepstral normalisation, vocal tract length normalisation, heteroscedastic linear discriminant analysis for feature projection, Minimum Phone Error Training and speaker adaptive training, lattice-based model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation and class based language models. The transcription system developed for participation in the 2002 NIST Rich Transcription evaluations of English conversational telephone speech data is presented in detail. In this evaluation the CU-HTK system gave an overall word error rate of 23.9%, which was the best performance by a statistically significant margin. Further details on the derivation of faster systems with moderate performance degradation are discussed in the context of the 2002 CU-HTK 10xRT conversational speech transcription system.
If you have difficulty viewing files that end '.gz'
,
which are gzip compressed, then you may be able to find
tools to uncompress them at the gzip
web site.
If you have difficulty viewing files that are in PostScript, (ending
'.ps'
or '.ps.gz'
), then you may be able to
find tools to view them at
the gsview
web site.
We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.