SPEECH MODELLING: MODELS, PARAMETER ESTIMATION AND ITS APPLICATION TO SPEECH ENHANCEMENT
G.A. Smith, A.J. Robinson, M. Niranjan
August 1999
In this report system identification techniques are applied to speech. The purpose is to compare different models and different parameter estimation techniques on both a theoretical and an empirical basis. Results are then used in the practical problem of speech enhancement in additive white Gaussian noise.
Two model families are identified: polynomial and state space models. These are compared within a common state space framework, which makes explicit the assumptions and constraints of different models regarding process noise, observation noise, input-output delays and initial conditions. State space models are then cast in block matrix form because this representation is used in subspace state space system identification (4SID) methods. Next, three common parameter estimation techniques are compared: prediction error minimisation (PEM), instrumental variables (IV) and 4SID. These models and parameter estimation techniques are compared through experiments on real clean and noisy speech data, and evaluated in terms of their prediction errors, spectrograms and perceptual quality of the one-step-ahead predicted waveform. Finally, the best models are used to initialise Kalman filters which are used to filter noisy speech (where noise is white, additive and Gaussian) in the speech enhancement problem.
In general, the results are that modelling accuracy is improved by using the glottal waveform, a more general noise model and non-zero initial conditions. This is evident by reduced model prediction errors and better noise model spectrograms. Voiced speech can be more accurately modelled than non-voiced speech. Regarding parameter estimation techniques, PEM gives smallest prediction errors, then 4SID then IV. 4SID methods have advantages, for example the one-step-ahead predicted waveform does not seem to suffer from musical noise like PEM methods. Other advantages include numerical stability and the use of a frequency-weighted balanced state space basis, which allows model order to be reduced in a simpler and better manner. In addition, PEM and 4SID weight modelling errors differently in the frequency domain. PEM, 4SID and IV methods can be used to initialise a Kalman filter, which can be applied to speech enhancement.
If you have difficulty viewing files that end '.gz'
,
which are gzip compressed, then you may be able to find
tools to uncompress them at the gzip
web site.
If you have difficulty viewing files that are in PostScript, (ending
'.ps'
or '.ps.gz'
), then you may be able to
find tools to view them at
the gsview
web site.
We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.