Unsupervised Syllable AcquisitionAimLearn syllable nuclei directly from at most half an hour of clean speech data. The data is unlabelled. The method should not try to find phonemes first. Answers should be accurate for different languages. So far...Experiments on part of the TIMIT corpus suggest that the third and fourth Laplacian Components of 20ms speech frames capture sonorance and silence respectively. Data Set of TIMIT speech frames. (Under 4 minutes of speech.) The first 5 Laplacian Components were obtained from this using lapheat.m (and L2_distance.m. The first and second Laplacian components do not have useful information. The fifth has no information(!). The third and fourth are below:
The file phone.index associates each phoneme with an integer representing its sonorance/obstruence/silence value. Use it in connection with index2label.m and the Partiview Color Map file dr1lapheat.cmap. Partiview files showing the first five components (and used to produce the above picture) are contained in this zip file. A dataset for Mandarin was then created. Using the Laplacian algorithm on this has so far failed - there are too many connected components. This may be because of musical intervals in the data. I will either have to modify the laplacian code to deal with components of the nearest-neighbor graph individually (initial attempts to do this have failed) or find another data set with only speech data. Going with the latter for now (May 19), will fix the Laplacian code while fixing other things and reading some papers. |