REDUCING WORD ERROR RATE OF FOUND SPEECH: XPERT TOOL FOR TRANSCRIPTION ANALYSIS
Automatic Speech Recognition (ASR) research is moving increasingly away from clean speech dictation systems, such as with single-speaker voice-dictation, to so-called ``found'' speech. This is when natural speech has been recorded, for example from a television broadcast, and an automatically-generated transcription is required. There are less stringent time constraints on such systems, and multi-pass strategies can be used, but the problem of recognition itself become much more difficult.
An illustration of a found-speech task is maintaining an archive of audio, for example, transmitted broadcast news. If an accurate transcription can be made of all the audio, then the required space to store the information content is reduced, information retrieval methods can produce efficient audio indexing and the archive becomes an audio library, where people can scan documents and find information they need without having to listen to the entire audio. The priority for a found-speech automatic speech recogniser in this case is therefore to produce as low a word error rate as possible.
Found speech raises many new problems which have not previously been tackled in single-speaker clean-speech dictation problems. Several different speakers will occur during the sound-track (Indeed in extreme cases more than one person may be speaking at one time). and there is no artificial indication of when a speaker change occurs. Also, since no restriction is made on who is speaking, it is possible to have non-native speakers whose voice characteristics are very different from the recogniser model. Similarly, speaking styles may vary. Speech is no longer in the controlled form people use when they know they are talking to a machine. Some of the speech can be prepared, producing grammatical sentences with a clear voice patte rn, but some can be spontaneous. The latter has a greater variability in speaking rate, a greater frequency of hesitations and false-starts of words and sentences and generally less grammaticality than is found in prepared speech.
Broadcast News transcription is also complicated by the presence of different audio conditions. A reporter in the field may be speaking over a telephone line, an announcer in a studio may be reading headlines over background music, there may be background noise, varying channel properties, degraded acoustics or any combinat ion of at any time during the sound-track.
Finally, the audio stream is continuous. Methods of automatically segmenting the speech into homogeneous segments using sentence boundaries become necessary. These segments ideally should only contain one speaker and one acoustic condition but again this is not always reliable and another source of error is introduced.
Since found speech is a much more difficult to transcribe than clean-speech, and the number of sources of error is greatly increased, there is a need for a specialised analysis tool to help identify the occurrence of systematic errors and if possible why they are occurring. Once identified, ways of tackling the causes of these errors could perhaps be devised.
In order to make the identification of errors, their characteristics and correlations between them possible a multi-media browsing tool, X-Program for Evaluating Recogniser Transcriptions (xpert), has been designed. This allows the analyser to listen to the audio, view the waveform and read both the correct transcription and recogniser output simultaneously. It highlights the errors which occur, and allows the user to zoom in at several levels to analyse the errors in more detail.
If you have difficulty viewing files that end
which are gzip compressed, then you may be able to find
tools to uncompress them at the gzip
If you have difficulty viewing files that are in PostScript, (ending
'.ps.gz'), then you may be able to
find tools to view them at
We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.