Abstract for tranter_icassp04

Proc. ICASSP 2004, Volume I, pp. 753-756


S. E. Tranter, K. Yu, G. Evermann, P. C. Woodland

May 2004

Speech recognition systems for conversational telephone speech require the audio data to be automatically divided into regions of speech and non-speech. The quality of this audio segmentation affects the recognition accuracy. This paper describes several approaches to segmentation and compares the resulting recogniser performance. It is shown that using Gaussian Mixture Models outperforms an energy-detection method and using the output from the speech recogniser itself increases performance further. An upper bound on possible performance was obtained when deriving a segmentation from a forced alignment of the reference words and this outperformed using manually marked word times. Finally the correlation between an appropriately defined segmentation score and WER is shown to be over 0.95 across three data sets, suggesting that segmentations can be evaluated directly without the need for full decoding runs.

| (ftp:) tranter_icassp04.pdf | (http:) tranter_icassp04.pdf | (http:) tranter_icassp04.html | (ftp:) tranter_icassp04.ps.gz | (http:) tranter_icassp04.ps.gz |

If you have difficulty viewing files that end '.gz', which are gzip compressed, then you may be able to find tools to uncompress them at the gzip web site.

If you have difficulty viewing files that are in PostScript, (ending '.ps' or '.ps.gz'), then you may be able to find tools to view them at the gsview web site.

We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.