Q3.1: Speech compression techniques

Provided by Tony Robinson:

The aim of speech compression is to produce a compact representation of speech sounds such that when reconstructed it is perceived to be close to the original. The two main measures of closeness are intelligibility and naturalness.

The standard reference point is toll quality speech, this is the same as what would be expected over a telephone line, for example, speech coded at 8 kHz using 8 bit ulaw coding and a maximum frequency of about 3.3 kHz. This is a bit rate of 64 kbps, and as such represents a compressed form over (say) 16 bit, 16 kHz speech which is the standard in speech recognition work.

ulaw coding does not exploit the (normally large) sample to sample correlations found in speech. ADPCM is the next family of speech coding techniques, and does exploit this redundancy by using a simple linear filter to predict the next sample of speech. The resulting prediction error is typically quantised to 4 bits thus giving a bit rate of 32 kbps (see, for example, the software in Q3.3: 32 kbps ADPCM, G.711/721/723 Compression, shorten). The advantages of ADPCM are that is simple to implement and has very low delay.

To obtain more compression specific properties of the speech signal must be modelling. The main assumption is known as the source filter model of speech production. This assumes that a source (voicing or fricative excitation) is passed through a filter (the vocal tract response) to produce the speech. The simplest implementation of this is known as a LPC synthesiser (e.g. LPC10e). At every frame the speech is analysed to compute the filter coefficients, the energy of the excitation, a voicing decision, and a pitch value if voiced. At the decoder a regular set of pulses for voiced speech or white noise for unvoiced speech is passed through the linear filter and multiplied by the gain to produce the speech. This is a very efficient system and typically produces speech coded at 1200-2400bps. With clever acoustic vector prediction this can be reduced to 300-600bps. The disadvantages are a loss of naturalness over most of the speech and occasionally a loss of intelligibility.

The CELP family of coders compensates for the lack of quality of the simple LPC model by using more information in the excitation. Each of a set of codebook of excitation vectors is tried and the index of the one that best matches the original speech is transmitted. This results in an increase in the bit rate to typically 4800-9600bps. Most speech coding research is currently directed towards CELP coders. (See, for example, CELP 3.2a, a TMS implementation, a G.728 LD-CELP vocoder, and the L&H implementation.


Back to Section 3 of the comp.speech FAQ Home Page.
Jump to SpeechLinks, [Q3.2], [Q3.3]

Administrivia, Copyright, Submit Information : Last Revision: 02:00 12-Apr-1996