Next: The TIMIT Database
Up:
A COMPARISON OF
Previous: INTRODUCTION
The phoneme recogniser presented here is derived from earlier work by two of the authors [1]. In all but the filterbank based preprocessors, the 16kHz digitised speech from the TIMIT database is Hamming windowed with a duration of 32ms and a frame separation of 16ms. This frame is passed to the various preprocessors to yield a vector of about 20 coefficients for the recurrent network. The net is trained on a 64 processor array of T800 transputers offering about 60 Mflops. The output from the net is interpreted as a vector of probabilities that the frame was labelled with a particular phoneme. This vector stream can be segmented using dynamic programming to yield the most likely string of phoneme symbols from the probability distribution. Greater accuracy can be achieved by using the durational and bigram transitional probabilities to constrain the phoneme string. Each of these steps will now be described in more detail pointing out where the current system deviates from that previously reported.