Pronunciation Dictionary

In constructing a speech recognition system, pronunciation information must be provided for all words spoken in both test and training data. A British English Example Pronunciation Dictionary (BEEP) has recently been developed for large vocabulary speech recognition with the WSJCAM0 corpus in mind. Two thirds of this dictionary stem from a combination of two sources: the MRC Psycholinguistic Database [2] and CUVOALD [3] (the Computer Usable Oxford Advanced Learners Dictionary). These databases have been publicly available for a number of years. In addition, a considerable number of new pronunciations have been provided from sources in Durham and Cambridge Universities.

The set of new symbols specific to British English were defined as follows, /oh/ for the vowel in ``pot'', and /ia/, /ea/ and /ua/ for the diphthongs in ``peer'', ``pair'' and ``poor'' respectively. The complete phone set used is shown in Appendix C of this document.

The accumulation of these diverse sources into a single standardised format produced over 100,000 word pronunciations. However, this process failed to cover 2,700 words of the 20,000 word lexicon evaluation task. The process of constructing pronunciations for these additional words proceeded in a number of stages. First, a tool was written to automatically derive word pronunciations for inflected words, by looking-up their stem pronunciations and appending that of the correct inflection from morphological rules of English [4] . This functioned for words with the following suffixes: [-es, -s, -'s, -', -ed, -d, -er, -r, -est, -ing]

This worked by considering each member of the 20,000 word list (having a suffix in the above set) and searching for their stems from the current pronunciation directory. As a result, several hundred new words were found. The remaining words needed for the 20k evaluation task (currently around 2,000) were taken from the CMU [5] pronunciation dictionary. This source was used as a last resort, since since they are US pronunciations. However, it is hoped to reduce the number of them in subsequent releases of the pronunciation dictionary. Finally, fifty seven words were not found in the CMU dictionary, and the pronunciations for these were manually entered into the dictionary.

The pronunciation dictionary on first release contains over 96,000 word definitions. Its format has been designed with machine readability a primary factor. It is thus unashamedly simple and an extract is included below.

The dictionary is freely available for non-commercial use by Internet FTP from host svr-ftp.eng.cam.ac.uk with file name /pub/comp.speech/data/beep-0.3.tar.Z.

Next: Data Formats Up: WSJCAM0 Corpus and Previous: Recording Session

Tue Jan 17 18:52:43 GMT 1995