Data types are differentiated by unique filename extensions. All files associated with the same utterance have the same basename. All filenames are unique across all WSJCAM and ARPA-collected WSJ corpora. Utterance IDs (basenames) will not be re-used. The filename format is as follows:
<UTTERANCE-ID>.<XXX>
where,
UTTERANCE-ID ::= <SSS><T><EE><UU>
where,
We were allocated the use of speaker IDs c00-czz. Speaker IDs c00-c2z were used for training speakers, speaker IDs c30-c4z were used for test speakers (both development and evaluation).
The file extensions are interpreted as follows:
XXX ::= (data type) .wv1 (channel 1 - Sennheiser waveform) .wv2 (channel 2 - Canford waveform) .ptx (prompting text) .dot (detailed orthographic transcription) .ifo (information file about speaker) .phn (TIMIT style phone alignments) .wrd (TIMIT style word alignments)