Recorded Material

All recorded sentences were taken from the Wall Street Journal (WSJ) text corpus. Since this text corpus had previously been recorded in American English for identical purposes, we could benefit from the materials used for that effort (see Paul & Baker, 1992; also ftp: gov.nist.ncsl.jaguar). Therefore we could make use of existing conventions, utilities, vocabularies, and large selections of processed texts from a real newspaper. The main problems with recording British English talkers reading WSJ prompts came as a consequence of the US origin of the text. This posed an extra pronunciation problem to some speakers, which compounded the difficulties with WSJ's financial jargon and written style. Therefore a modified pronunciation dictionary had to be constructed that covered UK pronunciations for some US-specific words.

The recording text material was used as follows. A common set of 18 adaptation utterances were recorded at the start of the session for each speaker (see Appendix A for all sentences):

one 3-second recording of background noise
2 phonetically balanced sentences
the first 15 of the 40 sentences designated for adaptation in the original WSJ0 corpus

The training sentences were taken from the WSJ0 training subcorpus of about 10,000 sentences. Each training speaker read about 90 training sentences, selected randomly in paragraph units. It was found that this was the maximum number of sentences that could be recorded in a one hour session. The same sentences were allowed to occur in several speakersÕ prompts, though never more than once per speaker.

Each test speaker read 80 sentences from the subcorpus originally designated for development testing in WSJ0, consisting of 40 sentences from the 5,000-word corpus (which contained a total of 2,000 sentences) and 40 sentences from the 64,000-word corpus (a total of 4,000 sentences). The test sentences were randomly selected and each test sentence was allowed to occur in only one speaker's prompt material. Since no sentence repetition between or within speakers was allowed for the testing portion of the corpus, this selection procedure exhausted the 5,000-word sentences almost completely (48 speakers by 40 sentences each). All sentences were taken from the non-verbalised pronunciation texts (i.e. there was no written punctuation words in the prompt material). All numerical data were written out in words in the prompts (i.e. so-called normalised texts were used).

Next: Recording Room Up: WSJCAM0 Corpus and Previous: Introduction

Tue Jan 17 18:52:43 GMT 1995