Data sets for Dinoj's thesis

PDF of thesis

PID68_Band90.zip (141Mb) has a dataset of about 163 195 examples (122 397 training, 40 798 testing) each with 158 dense features that were chosen to be the most useful out of about 520 features. It has the following files

  • featnames_PID68_Band90.txt has on each line the feature number, feature number in original 520-feature set (ignore that), name of feature. Features 1-6 are durational, 7-37 intensity, 38-68 pitch, 69-158 Band energy.
  • trainData_PID68_Band90.txt : contains a 122 397 x 158 dense matrix of data. Like the other files below, it has an entry for one Mandarin syllable per line.
  • trainLabels_PID68_Band90.txt : contains 122 397 x 1 matrix; entry for n-th line is k, k=1..5, if n-th training syllable had tone k (1=High tone, 2=Rising tone, 3=Low tone, 4=Falling tone, 5=neutral).
  • testData_PID68_Band90.txt : contains 40 798 x 158 dense matrix of test data.
  • testLabels_PID68_Band90.txt : contains 40 798 x 1 matrix of labels for test data.