Next: Recurrent Networks for
Up: The Hybrid RNN/HMM
Previous: The HMM Framework
Context is very important in speech recognition at multiple levels. On a short time scale such as the average length of a phone, limitations on the rate of change of the vocal tract cause a blurring of acoustic features which is known as co-articulation. Achieving the highest possible levels of speech recognition performance means making efficient use of all the contextual information.
Current HMM technology primarily approaches the problem from a top-down perspective by modelling phonetic context. The short-term contextual influence of co-articulation is handled by creating a model for all sufficiently distinct phonetic contexts. This entails a trade off between creating enough models for adequate coverage and maintaining enough training examples per context so that the parameters for each model may be well estimated. Clustering and smoothing techniques can enable a reasonable compromise to be made at the expense of model accuracy and storage requirements (e.g., [5,6]).
Acoustic context in HMMs is typically handled by increasing the dimensionality of the observation vector to include some parameterisation of the neighbouring acoustic vectors. The simplest way to accomplish this is to replace the single frame of parameterised speech by a vector containing several adjacent frames along with the original central frame. Alternatively, each frame can be augmented with estimates of the temporal derivatives of the parameters [7]. However, this dimensionality expansion quickly results in difficulty in obtaining good models of the data. Multi-layer perceptrons (MLPs) have been suggested as an approach to model high-order correlations of such high-dimensional acoustic vectors. When trained as classifiers, MLPs approximate the posterior probability of class occupancy [8,9,10,11,12]. For a full discussion of this result to speech recognition see [13,4].