Video Mail Retrieval Using Voice

Project Staff:

Prof. Steve Young ¹ (sjy@eng.cam.ac.uk)
Dr. Karen Sparck-Jones ² (ksj@cl.cam.ac.uk)
Dr. Jonathan Foote ¹ (jtfoote@bigfoot.com)
Dr. Gareth Jones ¹ ² (gareth@dcs.ex.ac.uk)
Dr. Martin Brown ³ (mgb@orl.co.uk)

¹ Cambridge University Engineering Department
² Cambridge University Computer Laboratory
³ Olivetti Research Laboratory

Introduction

Interest in video-based communications and multimedia is growing rapidly. Cambridge University, in collaboration with Olivetti Research Laboratory (ORL), is developing the Medusa networked multimedia system, which is now in regular use on a high speed ATM network covering ORL, the Computer Laboratory and the Engineering Department's SVR Group. One of the most popular Medusa services is video mail, and users are now amassing large archives of stored video messages. However, unlike regular electronic mail which is easily searched using conventional text retrieval methods, video mail only has images and sound. The goal of this project is to develop retrieval methods based on spotting keywords in the audio soundtrack. The project has successfully integrated speech recognition methods and information retrieval technology to yield a practical audio and video retrieval system.

This project was supported by the UK DTI Grant IED4/1/5804 and SERC Grant GR/H87629.

-> return to top

Project Objectives

To develop robust unrestricted keyword spotting algorithms for use in audio and video document retrieval.
To adapt existing text-based information retrieval techniques to work effectively on voice and video data types.
To develop and demonstrate a practical system providing video document retrieval using voice.

-> return to top

Progress

The project is organised in 3 stages, each lasting one year and culminating in a prototype demonstration system. The first stage prototype was completed in September 1994 and successfully demonstrated message retrieval from known speakers using a set of 35 predefined keywords. The second stage, completed in September of 1995, extended this to allow unknown speakers. In July 1996 the final stage demonstrated open-keyword video document retrieval from arbitrary speakers, as well as a video mail browser allowing random access to video documents.

Document Retrieval

Finding interesting material in a large collection of documents is often time consuming and inefficient. In automated text retrieval systems statistical techniques are applied to a search query, seeking relevant material present in the document archive. The retrieval system outputs a list of potentially relevant documents for the user to inspect, ranked by a score reflecting the match between the query and each individual document. Effective retrieval systems for electronic text archives have been developed using this statistical approach, while similar methods are now being used for Web search engines. But while text documents may be readily indexed by their contents, determining the information content of audio documents is considerably more difficult. documents. In the absence of manually-generated transcriptions, spoken documents can be retrieved only if their contents can be indexed using an automatic speech recognition (ASR) system. Speech recognition is a non-trivial process and presents several challenges in the retrieval domain.

-> return to top

Speech Recognition

Ideally, a speech recogniser would generate an exact transcription of the document contents, regardless of speaking style, vocabulary, or the acoustic environment. However, despite recent advances in ASR technology, this ideal system is not yet practical. State-of-the-art ASR systems can recognise vocabularies of many thousands of words. However, out-of-vocabulary (OOV) words, such as many proper nouns, cannot be recognised. This is a particular problem in retrieval applications where users frequently wish to search using OOV words including names of people, products, places or jargon. The VMR system overcomes this problem using novel approach. The ASR system generates a generalised sub-word or phone lattice. Spoken words can be decomposed into a sequence of phone units (of which there are about 45 in British English). Because speech recognition is computationally expensive, the recognition phase is performed in advance of retrieval. To search for a query word during retrieval, the pre-computed lattices may be rapidly scanned (many times faster than real-time) for phone strings corresponding to the query word. As with all ASR systems, lattice word spotting is imperfect and is prone to false alarms (hypotheses of a word when it is not present), and misses (failures to hypothesise words which are present). Experiments have shown that statistical methods allow robust retrieval despite these search errors. ASR in the VMR prototype is implemented using the Cambridge/Entropic HTK toolkit with speaker-independent hidden Markov models.

-> return to top

The Retrieval System

Figure 1. Block diagram of video mail retrieval system

Figure 1 shows an overview of the VMR system. New mail is passed to the ASR which computes a phone lattice for the message. To search for interesting messages, the user inputs words that indicate the information need. A match score is then computed between the query and each of the messages. The user is then presented with a ranked list of the potentially most interesting messages.

The match score does not require all the query words to be present in the message, but rather forms a query/message correlation score. Individual words in each message are assigned a weight that depends on the frequency of the word in the message, the number of messages in which the word appears as well as the length of the message. Thus, high scoring messages may actually have fewer matching words than lower scoring ones. The retrieval system's user interface presents the matching score graphically so that interesting messages may be quickly identified.

Figure 2. Video mail retrieval GUI

Figure 2 shows the user interface displaying a ranked list of messages --- the result of the query "folk festival cam-bridge." Prior to a search, the message archive is shown ranked by date and time of sending. The list can be narrowed to show only messages from selected originators. When a query is entered the list is re-ranked by the query/message match score. The bars to the the right of the messages graphically indicate the relative score of each message.

-> return to top

The Video Browser

While there are convenient methods for the graphical browsing of text, eg scroll bars, ``page-forward'' commands, and word-search functions, existing video and audio playback interfaces almost universally adopt the ``tape recorder'' metaphor. To scan an entire message, it must be auditioned from start to finish to ensure that no parts are missed. Even if there is a ``fast forward'' button it is generally a hit-or-miss operation to find a desired section in a lengthy message. In contrast, the transcription of a minute-long message is typically a paragraph of text, which may be scanned by eye in a matter of seconds. Clearly there must be more economical ways to access and review audio/video data.

Figure 3. The video mail browser

The video browser shown in Figure 3 attempts to represent a dynamic time-varying process (the video stream) by a static image that can be taken in at a glance. A message is represented as horizontal timeline, and keyword events are displayed graphically along it. Time runs from left to right, and events are represented proportionally to when they occur in the message; for example, events at the beginning appear on the left side of the bar and short-duration events are short. In the browser shown above, the timeline is the black bar and the scale indicates time in seconds. During playback, or when pointed at with the mouse, a keyword hit is highlighted and its name is displayed. (In the figure, the keyword "FESTIVAL" has just been played.) The message may be played starting at any time simply by clicking at the desired time in the time bar; this lets the user selectively play regions of interest, rather than the entire message.

-> return to top

Conclusion and Future Work

The VMR project has demonstrated automatic retrieval of spoken documents both experimentally and with a working prototype. Future work will focus on combining phone lattice information with automatic transcription and extending the retrieval techniques to handle larger and more diverse message sets.

A new project on Multimedia Document Retrieval developping on from this work is now underway at Cambridge University.

-> return to top

Publications:

J.T. Foote, S.J. Young, G.J.F. Jones and K. Spärck Jones
Unconstrained keyword spotting using phone lattices with application to spoken document retrieval
Computer Speech and Language, 11, 1997, pp. 207-224
G.J.F. Jones, J.T. Foote, K. Spärck Jones and S.J. Young:
Video mail retrieval using voice: report on topic spotting
(Deliverable Report on VMR Task No. 6), Technical Report 430, Computer Laboratory, University of Cambridge, 1997.
S.J. Young, M.G. Brown, J.T. Foote, G.J.F. Jones and K. Spärck Jones
Acoustic indexing for multimedia retrieval and browsing
Proc. ICASSP-97, Vol. 1, 1997, pp. 199-202
G.J.F. Jones, J.T. Foote , K. Spärck Jones and S.J. Young:
Video mail retrieval using voice: report on collection of naturalistic requests and relevance assessments
Technical Report 402, Computer Laboratory, University of Cambridge, 1996.
M. G. Brown, J. T. Foote, G. J. F. Jones, K. Spärck Jones, and S. J. Young.
Open-vocabulary speech indexing for voice and video mail retrieval.
Proc. ACM Multimedia 96, pp. 307-316 Boston, November 1996. ACM.
Best Paper Award
G. J. F. Jones, J. T. Foote, K. Spärck Jones, and S. J. Young.
The Video Mail Retrieval project: Experiences in retrieving spoken documents.
Intelligent Multimedia Information Retrieval, 1996.
Editor: M. T. Maybury, Menlo Park CA: AAAI Press, Cambridge MA: MIT Press, 1997 , pp. 191-214
G. J. F. Jones, J. T. Foote, K. Spärck Jones, and S. J. Young.
Retrieving spoken documents by combining multiple index sources.
Proc. SIGIR 96, Research and Development in Information Retrieval,pp 30-38
Zürich, August 1996. ACM. Best Paper Award
G. J. F. Jones, J. T. Foote, K. Spärck Jones, and S. J. Young.
Robust talker-independent audio document retrieval.
Proc. ICASSP 96, Vol. I, pp. 311-314, Atlanta, GA, May 1996.
G. J. F. Jones, J. T. Foote, K. Spärck Jones, and S. J. Young.
Video Mail Retrieval Using Voice: An Overview of the Stage 2 System.
Proc. of the Final Workshop on Multimedia Information Retrieval (MIRO '95)
I. Ruthven, editor, Electronic Workshops in Computing, Springer-Verlag, March 1996.
K. Spärck Jones, G.J.F.Jones, J.T. Foote, S.J.Young
Experiments in Spoken Document Retrieval
Information Processing and Management, 32 (4), pp. 399-417, 1996
K. Spärck Jones
Spoken Document Retrieval
Video of the Seminar: Computer Laboratory, University of Cambridge, 1995
M. G. Brown, J. T. Foote, G. J. F. Jones, K. Sparck Jones, and S. J. Young.
Automatic content-based retrieval of broadcast news.
Proc. ACM Multimedia 95 pp. 35-43, San Francisco, November 1995. ACM.
J. T. Foote, G. J. F. Jones, K. Spärck Jones, and S. J. Young.
Talker-independent keyword spotting for information retrieval.
Proc. Eurospeech 95, Vol. 3, pp. 2145-2148, Madrid, September 1995. ESCA.
J. T. Foote, M. G. Brown, G. J. F. Jones, K. Spärck Jones, and S.J. Young.
Video Mail Retrieval by voice: Towards intelligent retrieval and browsing of multimedia documents.
Proc. IMMI-1 , First International Workshop on Intelligence and Multimodality in Multimedia Interfaces, Edinburgh, Scotland, July 1995.
K. Spärck Jones, J. T. Foote, G. J. F. Jones and S. J. Young.
Retrieving spoken documents: VMR Project experiments
Techincal Report 366, Computer Laboratory, University of Cambridge, 1995.
G. J. F. Jones, J. T. Foote, K. Spärck Jones, and S. J. Young.
Video Mail Retrieval: the effect of word spotting accuracy on precision.
Proc. ICASSP 95, Vol. 1, pp. 309-312, Detroit, May 1995. IEEE.
K. Spärck Jones, J. T. Foote, G. J. F. Jones, and S. J. Young.
Spoken document retrieval --- a multimedia tool.
Fourth Annual Symposium on Document Analysis and Information Retrieval,
pp. 1-11, University of Nevada, Las Vegas, January 1995.
M. G. Brown, J. T. Foote, G. J. F. Jones, K. Spärck Jones, and S. J. Young.
Video Mail Retrieval using Voice: An overview of the Cambridge/Olivetti retrieval system. (see also ORL Tech Report 94-8)
Proc. ACM Multimedia 94 Workshop on Multimedia Database Management Systems,
pp. 47-55, San Francisco, CA, October 1994.
G. J. F. Jones, J. T. Foote, K. Spärck Jones, and S. J. Young.
Video Mail Retrieval using Voice : Report on keyword definition and data collection.
Technical Report 335, Computer Laboratory, University of Cambridge, May 1994.

-> return to top

sej28@eng.cam.ac.uk
Mon Nov 10 1997

Contents:

Project Staff: