Speech Frames (TIMIT DR1) dataset

Dinoj Surendran, April/May 2004

dr1.dat has a 3422 x 80 matrix of real numbers, representing 3422 samples of speech from the DR1 sample of TIMIT/TRAIN. Each frame represents 20ms of speech computed with a Multitapered FFT (multitaper.m). Frames were taken every 50 ms to ensure there was no overlap and reduce the correlation between adjacent frames.

dr1.labels has a 3422 x 1 vector of labels. Each label is of the form ph_a_b_fname_m, where a < b are integers. This means that the frame is in a phoneme ph that is b milliseconds long, and that the 20 ms of the frame ended a milliseconds from the start of ph. The phoneme came from the file fname in TIMIT/TRAIN/DR1 and occurred m milliseconds from the start of the file.

For example, the first frame has a label sh_25_80_dr1/mdab0/sa1_175, meaning that it lasted from the 5th to 25th milliseconds of a phoneme sh that was 80 milliseconds long and occurred 175 milliseconds into the file timit/train/dr1/mdab0/sa1.wav .

dr1phones.labels is the smaller version of dr1.labels, having only got phone information.

You may find the file loadcell.m useful with reading in the label files. It works like the Matlab load command, but loads into a cell array instead of into a numeric matrix.