Compression for speech corpora



next up previous
Next: Waveform Modeling Up: SHORTEN: Simple lossless and Previous: Introduction

Compression for speech corpora

One important use for lossless waveform compression is to compress speech corpora for distribution on CDROM. State of the art speech recognition systems require gigabytes of acoustic data for model estimation which takes many CDROMs to store. Use of compression software both reduces the distribution cost and the number of CDROM changes required to read the complete data set.

The key factors in the design of compression software for speech corpora are that there must be no perceptual degradation in the speech signal and that the decompression routine must be fast and portable.

There has been much research into efficient speech coding techniques and many standards have been established. However, most of this work has been for telephony applications where dedicated hardware can used to perform the coding and where it is important that the resulting system operates at a well defined bit rate. In such applications lossy coding is acceptable and indeed necessary order to guarantee that the system operates at the fixed bit rate.

Similarly there has been much work in design of general purpose lossless compressors for workstation use. Such systems do not guarantee any compression for an arbitrary file, but in general achieve worthwhile compression in reasonable time on general purpose computers.

Speech corpora compression needs some features of both systems. Lossless compression is an advantage as it guarantees there is no perceptual degradation in the speech signal. However, the established compression utilities do not exploit the known structure of the speech signal. Hence shorten was written to fill this gap and is now in use in the distribution of CDROMs containing speech databases [1].

The recordings used as examples in section 3 and section 5 are from the TIMIT corpus which is distributed as 16 bit, 16kHz linear PCM samples. This format is in common used for continuous speech recognition research corpora. The recordings were collected using a Sennheiser HMD 414 noise-cancelling head-mounted microphone in low noise conditions. All ten utterances from speaker fcjf0 are used which amount to a total of 24 seconds or about 384,000 samples.



next up previous
Next: Waveform Modeling Up: SHORTEN: Simple lossless and Previous: Introduction



Tony Robinson: ajr4@cam.ac.uk