Multimedia Document Retrieval (MDR)
|
Main Project Staff |
Industrial Collaborators |
http://www.entropic.com |
Dr. Ken Wood ³
(krw@uk.research.att.com) ³ AT&T Laboratories, Cambridge http://www.uk.research.att.com/ |
Project Objectives |
The work will pursue the following specific objectives against which the success of the project may be judged.
Progress to Date (summary) |
September 1997 -
Hub 4 evaluation and Recogniser Improvements
Work initially focussed on the
1997 Hub 4
Broadcast News evaluation. New segmentation, clustering
and better modelling was introduced. The final HTK system
yielded an overall word error rate of 22.0%
on the 1996 unpartitioned broadcast news development test data
and just 16.2% on the evaluation test set - the lowest
overall word error rate in the 1997 DARPA broadcast news
evaluation, by a statistically significant margin.
(see
NIST report).
Our adaptation software was revised and extended to include speaker adaptive training (SAT) and a new maximum-likelihood based clustering scheme was developed and shown to reduce word error rates in recognition. In addition a tool for locating and analysing errors in found speech was developed.
January 1998 -
Speech and IR Development
In preparation for the
TREC-7 Spoken Document Retrieval evaluation,
work was done by Entropic Ltd.
to speed up the decoder.
The final two-pass transcription system, incorporating the new segmentation,
and clustering algorithms, gender/bandwidth dependent models,
MLLR adaptation, a 4-gram language model and a 65k vocabulary,
ran in approximately 50 times real time and gave a word
error rate of around 25% on the 50 hours of TREC-6 SDR data, which we
subsequently used for information retrieval development work.
IR development was carried out on a query set developed in-house. Standard notions of stopping and stemming were applied, with extra text processing, such as dealing with abbreviations (e.g. "U.S.A.") and known stemming exceptions (e.g. news/new; Californian/California). Query terms were weighted by their part-of-speech and some modest statistical pre-search query expansion was used to add terms to the query that were more common statistical hyponyms of the query terms. Finally, term-position indexing was implemented to allow phrasal terms, extracted from the query using the part of speech information, to be included. The combined weight formula was used to score the documents. All of these measures were shown to increase performance on our own development query set.
May 1998 -
The TREC-7 Evaluation
The overall word error rate for our recognition system on the 100 hours
of TREC-7 SDR data was 24.8%, the lowest
in the TREC-7 SDR evaluation.
Retrieval was run on these transcriptions and those from other
competing sites with word error rates ranging from 29% to 66%.
Our retrieval system
performed well and the evaluation showed that for this relatively small task
(in IR terms), the degradation of performance with word error was small.
(6% mean average precision for 25% word error rate).
We introduced the concept of a Processed Term Error Rate (PTER), to evaluate
transcription errors from a retrieval-based perspective and showed it
varied approximately linearly with mean average precision for our
system.
November 1998 -
Improving the Probabilistic IR Model
Partially Ordered Sets (posets) were introduced into the
improved probabilistic retrieval framework,
to allow semantically related words to be added to the query
- and were shown in increase IR performance.
Work on relevance feedback, both on the the test collection,
and on a (larger) parallel collection showed that both these
techniques can increase retrieval performance across a wide range
of transcription errors.
November 1998 - Hub4 evaluation
Systems were made for both the time-unlimited and 10x real time
1998 DARPA/NIST Hub4 evaluations, in conjunction with
Entropic.
This included vocal-tract length normalisation;
cluster-based cepstral mean and variance normalisation;
better acoustic and language models and improved adaptation
using a full variance transformation in addition to standard MLLR.
The final HTK unconstrained compute system ran in about 300xRT and
gave an overall word error rate of 13.8% on the complete evaluation
data set (not significantly different statistically to the best system).
The 10xRT system used a two-pass strategy with highly optimised
decoder, giving an error rate of 16.1% (the lowest for the 10xRT task by a
statistically significant margin.)
March 1999 -
Modelling Speakers and Speaking Rates
Further work in speaker clustering showed that by modifying the
clustering system used in recognition, the automatically generated
segments could be split into speaker groups quite successfully.
This could allow, for example, all speech said by the announcer in
a news show to be extracted and presented to the user as a
summary of the main news of the day.
The correlation between an inter-frame distance measure and a phone-based concept of speaking rate was also investigated. It was shown that it is possible to build MAP estimators to distinguish between fast and slow phonemes and recognition was improved by incorporating this information into an HMM system.
July 1999 -
The TREC-8 Evaluation
Work then focussed on the
1999 TREC-8 SDR evaluation.
New models and parameter sets were made for the segmentation and clustering.
A novel algorithm was developed to detect commercials by searching for
repeating audio in the broadcasts. This rejected 42.3 hours of audio,
of which 41.4 hours was labelled as non-story content in the reference.
A larger vocabulary of 108k words was used to reduce the OOV problem,
whilst the highly-optimised decoder from Entropic allowed the system
to run in 13 times real time. Improved acoustic and language modelling
also helped increase accuracy over the TREC-7 SDR system.
The final WER on the 10 hour scored subset of the TREC-8 SDR corpus was
20.6% (the lowest in the evaluation).
The IR system was augmented with:
November 1999 - Analysing Results from the TREC-8 evaluation
We achieved an average precision of 55.3% on our own transcriptions
(the best in the evaluation) and 41.5% for the story-unknown evaluation.
Work focussed on analysing the effects of the various components of the
system and the average precision for the story-unknown case was
increased to 46.5% by making minor modifications.
February 2000 - Investigating Out of Vocabulary (OOV) Effects
A key issue in SDR is the effect of OOV words.
If OOV words occur in the spoken documents recognition errors
result, while if there are OOV terms in written queries the OOV terms
will not match the documents in the collection. We studied the impact
of OOV effects in the context of the TREC-8 SDR corpus and query
set. Using a fast recogniser we transcribed the 500 hours of TREC-8
broadcast news material with 5 different recognisers which only
differed in vocabulary size (55k, 27k, 13k, 6k, 3k) and covered a
large range of OOV rates. We ran a series of IR
experiments on these different transcription sets with
IR systems with and without both query and document expansion.
Query expansion used blind relevance feedack from either
just the document collection and/or from a large parallel collection
of newspaper texts. Document expansion used the spoken documents as
queries to the parallel collection to add terms to the original
documents and can directly compensate for OOV effects. The experiments
showed that the use of parallel corpora for query and document
expansion can compensate for the effects of OOV words to a large
extent for moderate OOV rates. These results imply that, at least for
the type of queries and documents used in TREC8, further compensation
for OOV words using e.g. phone lattices is not needed.
March 2000 - Consolidating the MDR demo
The MDR demo system, which downloads RealAudio from the WWW, automatically
transcribes this and allows the user to search the resulting database, was
expanded and improved. Both filtering with fixed queries, and conventional
retrieval with new queries, were allowed with the user able to browse
the transcripts as well as listen to the original audio. Interactive query
expansion using semantic posets and relevance feedback was also included,
whilst keyword highlighting and displaying the best-matching extract
allowed the user to find relevant passages more quickly. The system was
also extended to form artificial story boundaries for convenient
retrieval when the original
audio was not pre-segmented by topic. The demo was presented at both
RIAO 2000 and SIGIR 2000 and attracted considerable interest.
August 2000 - TREC-9
The TREC-9 SDR evaluation was similar to the TREC-8 evaluation, using the
same document collection and main tasks, but had a few subtle differences.
The story-unknown task became the main focus, and the use of non-lexical
automatically derived information (such as the gender of the speaker, or
the presence of music) was allowed in retrieval for the first time. We
generated the following non-lexical tags: segment, gender, bandwidth,
high-energy, no-speech (silence, noise or music), repeats, and
commercials. We focussed on improving the performance of the system and
obtained over 51% AveP for the story unknown case and over 60% for the
story known case in trials using the TREC-8 queries
(c.f. 41.5 and 55.3% respectively in the TREC-8 evaluation.)
The team's results in TREC-9 were very good, clearly demonstrating effective retrieval performance in the story-unknown task and showing that spoken document retrieval, even when the word error rate in recognition is not trivial, is a perfectly practical proposition. The final project publications include the TREC-9 paper, 2 papers in the International Journal of Speech Technology and a comprehensive technical report
Related Projects and Issues |
Previous Work |
Future Work | |
This project follows on from a project on Video Mail Retrieval Using Voice |
|
Issues which might be appropriate for future work and grants are
|
which was a collaboration between Cambridge University Computer Laboratory and Engineering Department, and ORL (now AT&T Laboratories). It explored spoken document retrieval for video mail messages using a small corpus of messages along with two query sets. This initial study of spoken document retrieval showed that probabilistic retrieval techniques could be successfully combined with classical speech recognition methods for acceptable system performance. |
References |
This work is funded by
EPSRC grant GR/L49611
This Page is maintained by Sue Johnson,
sej28@eng.cam.ac.uk |