[Univ of Cambridge] [Dept of Engineering]

Proceedings of the ESCA ETRW workshop

Accessing information in spoken audio


Speech is more that just a way of inputting commands and text into a computer. It is an information rich medium in itself, capable of holding expressions of thought, ideas, feelings and emphasis which would be lost in even the most accurate transcription. It is now easy to store hours or even years of speech; but how can we access the stored information easily and quickly?

This ESCA Tutorial and Research Workshop was held on the 19th and 20th April 1999 in Cambridge, UK.

The workshop was organised by Tony Robinson and Steve Renals in conjunction with Cambridge Programme for Industry.

All the proceedings are available from the links below. Additionally the complete proceedings are available in PostScript format as esca99cam.tar.gz or esca99cam.zip and references to all papers are available in BibTeX format.

Following on from the workshop there will be special issue of Speech Communication on this topic. The Call For Papers has a deadline of Friday 23 July 1999.

Title page, contents page and author index
PostScript PDF

Spoken Document Retrieval: 1998 Evaluation and Investigation of New Metrics
John S. Garofolo, Ellen M. Voorhees, Cedric G. P. Auzanne and Vincent M. Stanford
PostScript PDF

This paper describes the 1998 TREC-7 Spoken Document Retrieval (SDR) Track which implemented an evaluation of retrieval of broadcast news excerpts using a combination of automatic speech recognition and information retrieval technologies. The motivations behind the SDR Track and background regarding its development and implementation are discussed. The SDR evaluation collection and topics are described and summaries and analyses of the results of the track are presented. Alternative metrics for automatic speech recognition as applicable to retrieval applications are also explored. Finally, plans for future SDR tracks are described.

General Query Expansion Techniques for Spoken Document Retrieval
Pierre Jourlin, Sue E. Johnson, Karen Sparck Jones and Philip C. Woodland
PostScript PDF

This paper presents some developments in query expansion and document representation of our Spoken Document Retrieval (SDR) system since the 1998 Text REtrieval Conference (TREC-7).

We have shown that a modification of the document representation combining several techniques for query expansion can improve Average Precision by 17% relative to a system similar to that which we presented at TREC-7. These new experiments have also confirmed that the degradation of Average Precision due to a Word Error Rate (WER) of 25% is relatively small (around 2% relative). We hope to repeat these experiments when larger document collections become available to evaluate the scalability of these techniques.

The THISL Broadcast News Retrieval System
Dave Abberley, David Kirby, Steve Renals and Tony Robinson
PostScript PDF

This paper described the THISL spoken document retrieval system for British and North American Broadcast News. The system is based on the Abbot large vocabulary speech recognizer, using a recurrent network acoustic model, and a probabilistic text retrieval system. We discuss the development of a realtime British English Broadcast News system, and its integration into a spoken document retrieval system. Detailed evaluation is performed using a similar North American Broadcast News system, to take advantage of the TREC SDR evaluation methodology. We report results on this evaluation, with particular reference to the effect of query expansion and of automatic segmentation algorithms.

HALPIN: A multimodal and conversational system for information seeking on the World Wide Web
Jose Rouillard and Jean Caelen
PostScript PDF

Giving to computers the ability to talk and understand a natural language conversation is a major field of research. We have developed the HALPIN (Hyperdialogue avec un Agent en Langage Proche de l’Interaction Naturelle) system to implement our multimodal conversational model for information retrieval. This dialogue-oriented interface allows the access to the INRIA's database (Institut National de Recherche en Informatique et Automatique, 83297 documents available), on the internet, in a natural language (NL) way, and gives its oral responses via usual browsers. The results of the first experiments show that the Halpin system provides some interesting dialogues (in particular with the beginners), according to the user’s goals and skills, that leads to information retrieval success, while searches with the original user interface (traditional web form) failed.

Latent Semantic Indexing by Self-Organizing Map
Mikko Kurimo and Chafic Mokbel
PostScript PDF

An important problem for the information retrieval from spoken documents is how to extract those relevant documents which are poorly decoded by the speech recognizer. In this paper we propose a stochastic index for the documents based on the Latent Semantic Analysis (LSA) of the decoded document contents. The original LSA approach uses Singular Value Decomposition to reduce the dimensionality of the documents. As an alternative, we propose a computationally more feasible solution using Random Mapping (RM) and Self-Organizing Maps (SOM). The motivation for clustering the documents by SOM is to reduce the effect of recognition errors and to extract new characteristic index terms. Experimental indexing results are presented using relevance judgments for the retrieval results of test queries and using a document perplexity defined in this paper to measure the power of the index models.

For more information see http://www.idiap.ch/~kurimo/thisl.html

A Hybrid Approach to Spoken Query Processing in Document Retrieval System
Nathalie Colineau and Ariane Halber
PostScript PDF

In the context of the THISL spoken document retrieval system, we present a hybrid approach to spoken query processing, which enables to increase recognition rates and to extract relevant information for the application. The query processing is distributed between grammar and language model, based on the assumption that a query can be decomposed in two relatively independent parts; the addressing form, which is parsed with a grammar, and the queried content, which is scored with trieve the content sequence, which allows us to consult the database, but also, to keep information about the query formulations in order to develop an interaction between the user and the retrieval engine. This leads us to work closely with the speech recogniser and to carry out together the recognition and the query analysis.

Language processing for spoken dialogue systems: Is shallow parsing enough?
Ian Lewin, Ralph Becket, Johan Boye, David Carter, Manny Rayner and Mats Wiren
PostScript PDF

With maturing speech technology, spoken dialogue systems are increasingly moving from research prototypes to fielded systems. The fielded systems however generally employ much simpler linguistic and dialogue processing strategies than the research prototypes. We describe an implemented spoken-language dialogue system for a travel planning domain which supports a mixed initiative dialogue strategy. The system accesses a commercially available travel information webserver. The system architecture combines both shallow and deep linguistic processors, partly so that a robust if shallow analysis is always available to the dialogue manager, and partly so that we can begin to examine where significant gains can be made by employing more advanced linguistic processing. We present the results of a preliminary investigation using data from a Wizard of Oz experiment. The results lend limited support to our original hypothesis that deep linguistic processing will prove useful at points where the user takes the initiative in driving the dialogue forward.

Statistical Annotation of Named Entities in Spoken Audio
Yoshihiko Gotoh and Steve Renals
PostScript PDF

In this paper we describe stochastic finite state model for named entity (NE) identification, based on explicit word-level n-gram relations. NE categories are incorporated in the model as word attributes. We present an overview of the approach, describing how the extensible vocabulary model may be used for NE identification. We report development and evaluation results on a North American Broadcast News task. This approach resulted in average precision and recall scores of around 83% on hand transcribed data, and 73% on the SPRACH recogniser output. We also present an error analysis and a comparison of our approach with an alternative statistical approach.

Task Dependent Loss Functions in Speech Recognition: Application to Named Entity Extraction
Vaibhava Goel and William Byrne
PostScript PDF

We present a risk-based decoding strategy for the task of Named Entity identification from speech. This approach does not select the most likely utterance produced by an ASR system, which would be the maximum a-posteriori (MAP) strategy, but instead chooses an utterance from an N-best list in an attempt to minimize the Bayes Risk under loss functions derived specifically for the Named Entity task. We describe our experimentation with three risk-based decoders corresponding to the following three performance evaluation criteria: the F-measure, the slot error rate, and the fraction of correctly identified reference slots. An unsupervised optimization is also applied to these decoders. The MAP decoder is used as the baseline for comparison. Our preliminary experiments with these task dependent decoders, using N-best lists of depth 200, show small but encouraging improvements in performance with respect to both manually tagged and machine tagged reference.

Evaluating Content Extraction from Audio Source
Lynette Hirschman, John Burger, David Palmer and Patricia Robinson
PostScript PDF

This paper discusses evaluation of content extraction from audio sources. The most straightforward approach is to adapt existing methods for written sources to handle audio input. A transcription then becomes the representation of the audio source in written form; it must capture the word stream, but also other information that aids in decoding the overall structure and content of the audio source, e.g., music, speaker changes, and speech repairs. The transcription must also support content annotation superimposed on the underlyin speech transcription. When automated speech recognition is used to generate the transcription, there is the additional problem of evaluating content extraction from a noisy transcription. In addition, audio sources differ from their written counterparts in genre and therefore in structure, vocabulary, and even in how names are used. If the audio includes spontaneous conversational speech, as opposed to planned speech, these differences become still more pronounced. We discuss how these differences affect the adaptation of text­based extraction evaluation to audio input. In addition, we describe two new content extraction evaluations that have been designed for use with both audio and written materials.

Phoneme-level indexing for fast and vocabulary-independent voice/voice retrieval
Alexandre Ferrieux and Stephane Peillon
PostScript PDF

This paper reports explorations on a novel approach for speech information retrieval with spoken queries. The method uses a two-layer decoding scheme, where the intermediary representation of speech is based on phonemes, which makes the system vocabulary-independent. Moreover, the use of synchronized lattices at this intermediary level is shown to improve the discriminative performance while decreasing the size of the parameter space, and with a very reasonable additional computational cost.

Phonetic Transcriptions on Phrase-Level
Roeland J. F. Ordelman, Arjan J. van Hessen and David A. van Leeuwen
PostScript PDF

Whereas nowadays within-word co-articulation effects are usually sufficiently dealt with in automatic speech recognition, this is not always the case with phrase level co-articulation effects (PLC). This paper describes a first approach in dealing with phrase level co-articulation by applying these rules on the reference transcripts used for training our recogniser and by adding a set of temporary PLC phones that later on will be mapped on the original phones. In fact we temporarily break down acoustic context into a general and a PLC context. With this method, more robust models could be trained because phones that are confused due to PLC effects like for example /v/-/f/ and /z/-/s/, receive their own models. A first attempt to apply this method is described.

Recognition-Compatible Speech Compression for Stored Speech
Roger Tucker, Tony Robinson, James Christie and Carl Seymour
PostScript PDF

Two important components of a speech archiving system are the compression scheme and the search facility. We investigate two ways of providing these components. The first is to run the recogniser directly from the compressed speech - we show how even with a 2.4kbit/sec codec it is possible to produce good recognition results; but the search is slow. The second is to preprocess the speech and store the extra data in a compressed form along with the speech. In the case of an RNN-HMM hybrid system, the posterior probabilties provide a suitable intermediate data format. Vector quantizing these at just 625 bits/sec enables the search to run many times real-time and still maintain good recognition accuracy.

Prediction of keyword spotting performance based on phonemic contents
David A. van Leeuwen, Wessel Kraaij and Rudie Ekkelenkamp
PostScript PDF

In word spotting, one of the main difficulties is the false alarms, especially for small words. A model is presented for predicting the false alarm rate on the basis of the phonemic content of a word. This model is tested for a word spotter that has been used in the TREC Spoken Document Retrieval (SDR) track. Finally, results are presented for the retrieval task.

Speaker-based segmentation for audio data indexing
Perrine Delacourt and David Kryze and Christian J. Wellekens
PostScript PDF

In this paper, we address the problem of the speaker-based segmentation, which is the first necessary step for several indexing tasks. It consists in recognizing from their voice the sequence of people engaged in a conversation. In our context, we make no assumptions about prior knowledge of the speaker characteristics (no speaker model, no speech model, no training phase). However, we assume that people do not speak simultaneously. Our segmentation technique takes advantages of two different types of segmentation algorithms. It is organized in two passes: first, the most likely speaker changing points are detected and then, they are validated or discarded. Our algorithm is efficient to detect speaker changing points even close to one another and is thus suited for segmenting conversations containing segments of any length.

Speaker Tracking in Broadcast Audio Material in the Framework of the THISL Project
Laurent Couvreur and Jean-Marc Boite
PostScript PDF

In this paper, we present a first approach to build an automatic system for broadcast news speaker-based segmentation. Based on a ``Chop-and-Recluster'' method, this system is developed in the framework of the THISL project. A metric-based segmentation is used for the ``Chop'' procedure and different distances have been investigated. The ``Recluster'' procedure relies on a ``bottom-up'' clustering of segments obtained beforehand and represented by non-parametric models. Various hierarchical clustering schemes have been tested. Some experiments on BBC broadcast news recordings show that the system can detect real speaker changes with high accuracy (mean error = 0.7s) and fair false alarm rate (mean false alarm rate = 5.5% ). The ``Recluster'' procedure can produce homogeneous clusters but it is not already robust enough to tackle too complex classification tasks.

Text Segmentation and Event Tracking on Broadcast News Via a Hidden Markov Model Approach
P. van Mulbregt, I. Carp, L. Gillick, S. Lowe and J. Yamron
PostScript PDF

Continuing progress in the automatic transcription of broadcast speech via speech recognition has raised the possibility of applying information retrieval techniques to the resulting (errorful) text. In this paper we describe a general methodology based on Hidden Markov Models and classical language modeling techniques for automatically inferring story boundaries (segmentation}) and for retrieving stories relating to a specific topic (tracking). We will present in detail the features and performance of the Segmentation and Tracking systems submitted by Dragon Systems for the 1998 Topic Detection and Tracking evaluation.

Optimal Parameters for Segmenting a Stream of Audio into Speech Documents
Gerard Quinn and Alan Smeaton
PostScript PDF

Indexing and retrieval of spoken documents is a desirable feature in a digital library as there is a wealth of contemporary information available to us uniquely in this medium. In this paper we describe experiments carried out on the TREC 6 SDR collection to determine the optimal parameters for our speech IR system. In the TREC task the data corresponded to complete news broadcasts and the boundaries between news stories were marked up manually. In an operational news speech retrieval system such as our own, news story boundaries are not always part of the speech data, making it difficult to automatically detect shifts in stories being broadcast. We describe our approach of splitting the stream of audio into speech documents of fixed length and analyse the results from each method culminating in an optimal solution. Using Location Information from Speech Recognition of Television News Broadcasts
Alexander G. Hauptmann and Andreas M. Olligschlaeger
PostScript PDF

The Informedia Digital Video Library system extracts information from digitized video sources and allows full content search and retrieval over all extracted data. This extracted 'metadata' enables users to rapidly find interesting news stories and to quickly identify whether a retrieved TV news story is indeed relevant to their query. Through the extraction of named entity information from broadcast news we can determine what people, organizations, dates, times and monetary amounts are mentioned in the broadcast. With respect to location data, we have been able to use location analysis derived from the speech transcripts to allow the user to visually follow the action in the news story on a map and also allow queries for news stories by graphically selecting a region on the map.

Audio Meeting History Tool: Interactive Graphical User-Support for Virtual Audio Meetings
David M. Roy and Saturnino Luz
PostScript PDF

Interactive graphical user support within an internet based virtual meeting place is provided by an Audio Meeting History Tool. The tool addresses the representation, storage, navigation and processing of meeting memory where speech is assumed to be the primary modality of interpersonal communication. Communicative turns are integrated with non-acoustic data to form the meeting history through a GUI component based on a ''musical score'' metaphor.

Summarisation of Spoken Audio Through Information Extraction
Robin Valenza, Tony Robinson, Marianne Hickey and Roger Tucker
PostScript PDF

Automatic summarisation of spoken audio is a fairly new research pursuit, in large part due to the relative novelty of technology for accurately decoding audio into text. Techniques that account for the peculiarities and potential ambiguities of decoded audio (high error rates, lack of syntactic boundaries) appear promising for culling summary information from audio for content-based browsing and skimming. This paper combines acoustic confidence measures with simple information retrieval and extraction techniques in order to obtain accurate, readable summaries of broadcast news programs. It also demonstrates how extracted summaries, full-text speech recogniser output and audio files can be usefully linked together through an audio-visual interface. The results suggest that information extraction based on statistical information can produce viable summaries of decoded audio.

Finding Information in Audio: A New Paradigm for Audio Browsing/Retrival
Julia Hirschberg, Steve Whittaker, Don Hindle, Fernando Pereira and Amit Singhal
PostScript PDF

Information retrieval from audio data is sharply different from information retrieval from text, not simply because speech recognition errors affect retrieval effectiveness, but more fundamentally because of the linear nature of speech, and of the differences in human capabilities for processing speech versus text. We describe SCAN, a prototype speech retrieval and browsing system that addresses these challenges of speech retrieval in an integrated way. On the retrieval side, we use novel document expansion techniques to improve retrieval from automatic transcription to a level competitive with retrieval from human transcription. Given these retrieval results, our graphical user interface, based on the novel WYSIAWYH (``What you see is almost what you hear'') paradigm, infers text formatting such as paragraph boundaries and highlighted words from acoustic information and information retrieval term scores to help users navigate the errorful automatic transcription. This interface supports information extraction and relevance ranking demonstrably better than simple speech-alone interfaces, according to results of empirical studies.

Last updated on 22 April 1999
Tony Robinson ajr@eng.cam.ac.uk