Multimedia Document Retrieval
|
September 1997 - Hub 4 Evaluation and Recogniser Improvement |
In parallel with the Hub4 system development, a number of algorithm improvements have been made. Our adaptation software was completely revised and extended to include speaker adaptive training (SAT). A new scheme was developed for finding speaker clusters in found speech such as broadcast news. This new scheme has been shown to increase the data likelihood in MLLR base adaptation and subsequently reduce word error rates in recognition (see [5]). In addition a tool for locating and analysing errors in found speech was developed (see [6]).
January 1998 - Speech and IR development for TREC-7 |
Work on information retrieval began by creating a pool of 60 queries of our own and generating manual relevance assessments. This gave us the opportunity to try out different retrieval methods and examine their relative benefits/disadvantages. The basic system began by splitting the transcriptions into pre-marked "stories", then removing words given in a stop-word list. These are mainly function words such as "a", "the" etc. Several small features were added at this stage to remove unfinished words, deal with abbreviations (e.g. "U.S.A."), double words (e.g. and/or), and words containing punctuation (e.g. Martha's). A mapping list was included to standardise the spellings of words which are frequently spelt incorrectly. A Porter Stemmer was then applied to strip the standard suffixes of words (e.g. managing, manager, managed -> manage). This algorithm is well-established, but known to make errors both in conflation (news->new) and omission (Californian/California). A dictionary of known problems was made to form a stemmer-exceptions list and was applied using the mapping functionality. The combined weight formula (see [1]) was used to score the documents.
Additional work looked into weighting the query terms depending on their
part of speech, found with a
Brill tagger.
Another development was using statistical pre-search query expansion, where
a term x is expanded with a term y if y
occurs in more documents than x and x occurs more often when
y is present than when it is absent.
y can be seen as a statistical hyponym of x.
The original original combined weight, weighted by the part-of-speech (POS),
is replaced by an expanded combined weight :
ecw(x,j) = POS(x) * sum_y [P(x/y) * cw(y,j)] / sum_y P(x/y)
Finally, term-position indexing was implemented to allow phrasal terms
to be added to the query. These were found by locating unstopped noun
compounds, or adjective-noun groups in the query and weighting them
by a tuned bigram-weight.
All of these measures were shown to increase performance on our own
development query set. (see [8] for more details)
May 1998 - The TREC-7 evaluation system |
We used all the retrieval techniques described above in the TREC-7 evaluation. A detailed breakdown of the effects of each of these techniques on the results can be found in [7]. Our retrieval system performed well and the evaluation showed that for this relatively small task (in IR terms), the degradation of performance with word error was small. (6% mean average precision for 25% word error rate). We introduced the concept of a Term Error Rate (TER), which evaluates transcription errors from a retrieval-based perspective ([8]), and showed that pre-processed term error rate (PTER) varies approximately linearly with mean average precision for the TREC-7 data using our retrieval system (see [7]).
November 1998 - Improving the Probabilistic IR Model |
Experiments using blind relevance feedback were also conducted. Results show that if a (large) parallel source of data is used for the feedback, the average precision increases on all transcriptions, but if the (small) test collection is used, then average precision only increases for the more accurate transcriptions.([14]) With all these improvements, the difference in retrieval performance between the manual transcriptions and our own was reduced to 1%. ([10]) A detailed set of experiments and subsequent analysis was made in [17].
November 1998 - Hub-4 evaluation |
Additions to the unconstrained system relative to the one used in the 1997 evaluation included vocal-tract length normalisation; cluster-based cepstral mean and variance normalisation; the use of twice as much acoustic training data; improved language model using a merged interpolated model and a more appropriate training data pool; and improved adaptation using a full variance transformation in addition to standard MLLR. The final HTK unconstrained compute system gave an overall word error rate of 13.8% on the complete evaluation data set (the difference to the best system was not statistically significant) and 7.8% on the baseline F0 broadcast speech condition (the lowest error rate) and the system represented a 13% relative reduction in error rate over the 1997 HTK Hub4 system. This unconstrained compute system ran in about 300xRT on a Sun Ultra2. Further details of the systems developed can be found in [11] and [15].
The 10xRT system was based on the 1997 evaluation system but discarded the quinphone stage and also used the enlarged training set and improved language modelling. The same two-pass overall strategy as used in the full system was employed with a highly optimised decoder supplied by Entropic which allowed the system to run in less than 10xRT on a 450MHz Pentium II based computer. The 1998 10xRT system gave the same error rate on the 1997 evaluation data (15.8%) as the full 1997 HTK system and 16.1% error on the 1998 evaluation set which was the lowest error rate for a system running in less than 10xRT by a statistically significant margin. Further details of the 10xRT system are given in [12] and [ 15].
The complete results from the evaluation can be browsed at ftp://jaguar.ncsl.nist.gov/csr98/h4e_98_official_scores_990119
March 99 - Modelling Speakers and Speaking Rates |
The correlation between an inter-frame distance measure and a phone-based concept of speaking rate was also investigated. It was shown that it is possible to build MAP estimators to distinguish between fast and slow phonemes. This speaking-rate information was then included into a standard HMM structure by adding the distance to the feature vector and changing the topology of the HMM. A slight overall improvement in recognition was shown, due to a strong improvement in the spontaneous and non-native speaker conditions. (see [16] for more details).
July 99 - The TREC-8 evaluation |
In ASR, the recogniser speed was greatly increased (from 50xRT to 13xRT) due to Entropic's highly optimised decoder. This also meant a larger vocabulary of 108k words could be used to reduce the problems from out-of-vocabulary information-bearing words, in the larger 500+ hour TREC-8 corpus. The error rate was also reduced (by about 10% relative to 15.7% on the 1998 Hub4 evaluation data). This is due to improved acoustic and language models and a much larger vocabulary. Our word error rate on the 10 hour scored subset of TREC-8 SDR was 20.6% (the lowest in the evaluation).
A novel algorithm was also developed for detecting commercials. This looks for and rejects repeated audio. The algorithm removes both whole commercials and within-broadcast jingles, and has the advantage not only of deleting material in which the user is (typically) not interested, but also of providing some structure for broadcasts. 65% of the commercials were removed for A.B.C. news shows, whilst erroneously removing only 28 seconds (0.02%) of news stories. Since the commercial detection stage removed around 8% of the entire audio, (of which 97.8% was marked non-story information in the reference), it also conveniently reduced the amount of data the transcription engine needed to process by 43 hours.
In IR, we augmented our TREC-7 SDR baseline system with five different techniques:
We also built a system for the story-boundary-unknown evaluation. The structure in the broadcast supplied by the commercial detector along with the audio segmentation was used. to force story breaks. We then applied a sliding window technique and performed retrieval on the pseudo stories defined by the windows, using all the methods listed apart from document feedback. Stories nearby in time were then combined as retrieved documents.
November 99 - Analysing Results from the TREC-8 evaluation |
Work on the story-unknown case showed that the automatic elimination of commercials increased the average precision of the overall system, whilst also reducing the amount of data to be recognised by 8%. This improvement could also be achieved on the transcriptions from other sites by applying a filter to remove the "commercials" after the retrieval but before scoring. Further experimentation reported in [21] showed that the performance for the story-unknown system could be increased to 46.5 by slight modification of the retrieval strategies and parameters.
Work on the direct audio search method developed for the evaluation showed that the technique was able to find exact matches of audio with 100% accuracy. The search can run in hundreds of times faster than real-time, and requires only a fraction of a second of cue-audio. (see [19] for more details).
February 2000 - Investigating Out of Vocabulary Effects |
March 2000 - Consolidating the MDR demo |
August 2000 - The TREC-9 Evaluation |
The team's results in TREC-9 were very good, clearly demonstrating effective retrieval performance in the story-unknown task and showing that spoken document retrieval, even when the word error rate in recognition is not trivial, is a perfectly practical proposition. The final project publications include the TREC-9 paper [24] , two papers in the International Journal of Speech Technology, one about the story-unknown retrieval system [26] and the other about the MDR demo system [25] and a comprehensive technical report covering a summary of experiments [27]
This work is funded by EPSRC grant GR/L49611
This Page is maintained by Sue Johnson,
sej28@eng.cam.ac.uk Sun 7 Oct 2001 |