## Tone Recognition in MandarinDinoj Surendran
and Gina-Anne
Levow and Yi Xu
## Phase 1 : Studying the interaction of tone and focus on a clean, focus-marked, lab speech data setData : tonefocus_is05.mat
(zipped, 1.7Mb) : Matlab file with this data from Xu (1999)
analyzed in Surendran, Levow & Xu
(Proc. ICSLP/Eurospeech 2005), henceforth referred to as SLX'05. If you ever use it
call it... say... Name Size Bytes Class L 11520x1 92160 double array X 11520x20 1843200 double array focus_npip 11520x1 92160 double array sentence 11520x1 92160 double array spkr 11520x1 92160 double array test4 1x4 92400 cell array train4 1x4 276720 cell array Each of the 11520 data points is a Mandarin syllable in lab speech that Xu collected for his 1999 JPhonetics paper from eight native speakers (four male, one female). - L(i) has the tone of the i-th syllable There are no neutral-tone syllables here, so the values in L are from 1 to 4.
- X(i,:) has a speaker-normalized pitch contour of the syllable, sampled at 20 points across its duration. Speaker-normalization means that you don't need to take the speaker into account when classifying. The normalization is just z-values taken over all syllables spoken by the same speaker.
- test4 and train4 have indices used in all the 4-fold
cross-validation experiments reported in SLX05. train4{i} has the
indices of examples (a subvector of [1:11520]) used for training in
the i-th fold, while test4{i} has the indices used for testing. We
split these so that each training set had all syllables of six
speakers and each test set had all syllables from the remaining two
speakers. (This was created using the command
`[train4,test4] = makefolds(spkr,4)`)The above is all you need to predict tone from pitch (within the syllable anyway). But, we continue... - spkr(i) has the speaker of the i-th syllable. Since the first
1440 syllables were spoken by the first speaker, the next 1440 by the
second, and so on, this was just created with
`spkr=ceil(([1:11520])/1440)'`. - focus_npip(i) is 0 if the syllable was in a sentence without focus and 1/2/3 if it was before/in/after (respectively) the focused syllable of the sentence. In case you're wondering, npip stands for no/pre/in/post .
- sentence(i) is the number (from 1 to 3840) of the sentence in
which this syllable was spoken. Each sentence has (in this dataset) 3
syllables, with the first three syllables are from the first sentence,
the next three from the second sentence, and so on, i.e.
`sentence=ceil([1:11520]/3)';`. (Actually, there were five syllables per sentence, but the first and last tones were always the same and therefore not in this dataset - if anyone wants them, ping me or Yi).
Tone Recognition using Focus (Surendran, Levow, Xu), Proceedings of Eurospeech/ICSLP 2005 Partiview 3d model (download, unzip, click, pray, see readme) showing the perfomance of the baseline svm on tone recognition. Uses the Parametric Embedding (Iwata et al, 2004) algorithm from LIBSVM's output with the "-b" flag (more documentation on this coming). Some pictures of this are below. Click to expand. Another 3d model: tonefocus_slx05.zip : has same data (4-th fold) but when different classifiers are created for three different focus condition groups (predicted by the confidence of tone classification). The attributes are "focuscond" with values 0 (no-focus), 1 (pre-focus), 2 (in-focus), 3 (post-focus) and "tone" with values 1,2,3,4. It was made with load tonefocus_is05.mat load splitbyconftonepred3LIN.mat for i=1:2880,pic4{i}=sprintf('syllz%d.sgi',test4{4}(i));end ndaona('publish','tonefocus_slx05', 'classprobs',FORFOLDS_lin.pe{4}, 'picdir','./images', 'pics',pic4,'attrib',[focus_npip(test4{4}) L(test4{4})],'attribnames',{'focuscond','tone'}, 'classes',L(test4{4}),'classnames',{'hi','rise','lo','fall'},'glyphsize',5); ## Some Results## One SVM for everything
[pred,wts,nc,cm,acc,probest,optsused] = batchtest(X,L,train4,test4,opts); opts = doscale: 0 getweights: 0 doweight: 1.0000e-03 libsvmdir: '/export/d2/scratch/dinoj/libsvm-2.8' % modify this to be the location of your svm-train nfolds: 5 kernelparam: 'findeach' kerneltype: 2 libsvmflags: '-m 1000' doprobest: 1 labels: [1 2 3 4] The probability estimates by themselves can be found in probest_rbf.dat. It is a 11520 x 4 matrix with the (i,j)-th entry representing the probability (according to LIBSVM's -b calculation) that the i-th point belongs to class j.
The probability estimates by themselves can be found in probest_lin.dat ## Predicting by groups of predicted focus (when focus known in training)Using tonefocus_is05_version2.mat Name Size Bytes Class L 11520x1 92160 double array X 11520x20 1843200 double array X_PI_ALL 11520x52 4792320 double array Ysent_foc 3840x1 30720 double array focus_npip 11520x1 92160 double array isfocused 11520x1 92160 double array sentence 11520x1 92160 double array spkr 11520x1 92160 double array test4 1x4 92400 cell array train4 1x4 276720 cell array Ysent_foc(j) is the focus condition of the j-th sentence/phrase. It equals n if the phrase has n-focus. n is between 0 and 3 inclusive.
- Columns 1 to 20 : pitch contour
- Columns 21 to 40 : intensity contour
- 41: mean pitch of syllable
- 42: mean pitch of syllable - mean pitch of preceding syllable (or 0 if this syllable is the first syllable in its sentence)
- 43: mean pitch of syllable - mean pitch of following syllable (or 0 if this syllable is the last syllable in its sentence)
- 44: pitch range (max pitch - min pitch) of syllable
- 45: pitch range of syllable - pitch range of preceding syllable (or 0 if this syllable is the first syllable in its sentence)
- 46: pitch range of syllable - pitch range of following syllable (or 0 if this syllable is the last syllable in its sentence)
- 47: mean intensity of syllable
- 48: mean intensity of syllable - mean intensity of preceding syllable (or 0 if this syllable is the first syllable in its sentence)
- 49: mean intensity of syllable - mean intensity of following syllable (or 0 if this syllable is the last syllable in its sentence)
- 50: intensity range (max intensity - min intensity) of syllable
- 51: intensity range of syllable - intensity range of preceding syllable (or 0 if this syllable is the first syllable in its sentence)
- 52: intensity range of syllable - intensity range of following syllable (or 0 if this syllable is the last syllable in its sentence)
These additional features were created using (initially Xintensity was what is now X_PI_ALL(:,21:40)) : X_pitch_features = createNBRfeatures(X,[1:3:11520]); X_intensity_features = createNBRfeatures(Xintensity,[1:3:11520]); X_PI_ALL = [X Xintensity X_pitch_features X_intensity_features]; isfocused is a binary vector; isfocused(j) is 1 iff the j-th syllable has focus OR if the j-th syllable is the final syllable in a 0-focus sentence. (This is so that 0-focus sentences are treated like 3-focus sentences.) It wass created using isfocused=zeros(11520,1); for i=1:3840, a=3*(i-1); if Ysent_foc(i), isfocused(a+Ysent_foc(i))=1; else isfocused(a+3)=1; end; end
opts2.labels=[0 1]; opts2.kerneltype = 0; opts2.doweight = 0.001; opts2.doscale = 0; opts2.getweights = 0; opts2.libsvmdir = '/home/dinoj/libsvm-2.8'; % change for your system opts2.libsvmflags = '-m 1000'; opts2.doprobest = 1; [pred,wts,nc,cm,acc,probest,optsused] = batchtest(X_PI_ALL,isfocused,train4,test4,opts2); % train focus predictor pe_isfocused_lin_PI_ALL=zeros(11520,2); for i=1:4,pe_isfocused_lin_PI_ALL(test4{i},:)=probest{i};end; conf_focuspred_lin_PI_ALL = pe_isfocused_lin_PI_ALL(:,2); % i-th entry is confidence of focus predictor that i-th syllable is focused [bestelemwise_predfocused_lin_PI_ALL,bestseqwise_predfocused_lin_PI_ALL]=choosebestscore(conf_focuspred_lin_PI_ALL,[1:3:11520],'last'); linconfpred_pip_PI_ALL = zeros(11520,1); for i=1:3840, b=bestseqwise_predfocused_lin_PI_ALL(i); for j=1:b-1, linconfpred_pip_PI_ALL(3*(i-1)+j)=1; end; linconfpred_pip_PI_ALL(3*(i-1)+b)=2; for j=b+1:3, linconfpred_pip_PI_ALL(3*(i-1)+j)=3; end; end; [cm_syll_PI_ALL,nc_syll_PI_ALL]=getcm(focus_npip,linconfpred_pip_PI_ALL,[0:3]); [cm_sent_PI_ALL,nc_sent_PI_ALL]=getcm(Ysent_foc,bestseqwise_predfocused_lin_PI_ALL,[0:3]); [FORFOLDS_lin_PI_ALL,FORSPLITS_lin_PI_ALL,DETAILS_lin_PI_ALL,COMBINED_lin_PI_ALL] = splitbysomething (X,L,train4,test4,linconfpred_pip_PI_ALL,{1,2,3},opts); The result of the above is saved in predfocus_lin_PI_ALL.mat.
optsrbf.labels=[0 1]; optsrbf.kerneltype = 2; optsrbf.kernelparam = 'findeach'; optsrbf.doweight = 0.001; optsrbf.doscale = 0; optsrbf.getweights = 0; optsrbf.libsvmdir = '/export/d2/scratch/dinoj/libsvm-2.8'; % change for your system optsrbf.libsvmflags = '-m 1000'; optsrbf.doprobest = 1; optsrbf.nfolds = 5; [pred,wts,nc,cm,acc,probest,optsused] = batchtest(X_PI_ALL,isfocused,train4,test4,optsrbf); % train focus predictor save ~/html/projects/tonefocus/predfocus_rbf_PI_ALL.mat pred nc cm acc probest optsused pe_isfocused_rbf_PI_ALL=zeros(11520,2); for i=1:4,pe_isfocused_rbf_PI_ALL(test4{i},:)=probest{i};end; conf_focuspred_rbf_PI_ALL = pe_isfocused_rbf_PI_ALL(:,2); % i-th entry is confidence of focus predictor that i-th syllable is focused [bestelemwise_predfocused_rbf_PI_ALL,bestseqwise_predfocused_rbf_PI_ALL]=choosebestscore(conf_focuspred_rbf_PI_ALL,[1:3:11520],'last'); rbfconfpred_pip_PI_ALL = zeros(11520,1); for i=1:3840, b=bestseqwise_predfocused_rbf_PI_ALL(i); for j=1:b-1, rbfconfpred_pip_PI_ALL(3*(i-1)+j)=1; end; rbfconfpred_pip_PI_ALL(3*(i-1)+b)=2; for j=b+1:3, rbfconfpred_pip_PI_ALL(3*(i-1)+j)=3; end; end; [cm_syll_PI_ALL,nc_syll_PI_ALL]=getcm(focus_npip,rbfconfpred_pip_PI_ALL,[0:3]); [cm_sent_PI_ALL,nc_sent_PI_ALL]=getcm(Ysent_foc,bestseqwise_predfocused_rbf_PI_ALL,[0:3]); optsrbf.labels=[1:4]; save ~/html/projects/tonefocus/predfocus_rbf_PI_ALL.mat pred nc cm acc probest optsused *rbf*PI_ALL [FORFOLDS_rbf_PI_ALL,FORSPLITS_rbf_PI_ALL,DETAILS_rbf_PI_ALL,COMBINED_rbf_PI_ALL] = splitbysomething (X,L,train4,test4,rbfconfpred_pip_PI_ALL,{1,2,3},optsrbf); save ~/html/projects/tonefocus/predfocus_rbf_PI_ALL.mat pred nc cm acc probest optsused *rbf*PI_ALL The result of the above is saved in predfocus_rbf_PI_ALL.mat. ## Predicting by Confidence-predicted focus (when focus not known during training)Results of running the below lines are in splitbyconftonepred3LIN.mat. opts.kerneltype = 0; opts.labels = [1 2 3 4]; opts.doweight = 0.001; opts.doscale = 0; opts.getweights = 0; opts.libsvmdir = '/home/dinoj/libsvm-2.8'; % change for your system opts.nfolds = 5; opts.libsvmflags = '-m 1000'; opts.doprobest = 1; PElin = load('probest_lin.dat') ; highestconf_lin = max(PElin'); [bestelemwise,bestseqwise] = choosebestscore(highestconf_lin,[1:3:11520],'last'); linconfpred_pip=zeros(11520,1); for i=1:3840, b=bestseqwise(i); for j=1:b-1, linconfpred_pip(3*(i-1)+j)=1; end; linconfpred_pip(3*(i-1)+b)=2; for j=b+1:3, linconfpred_pip(3*(i-1)+j)=3; end; end; [FORFOLDS_lin,FORSPLITS_lin,DETAILS_lin,COMBINED_lin] = splitbysomething (X,L,train4,test4,linconfpred_pip,{1,2,3},opts);
Results of running the below lines are in splitbyconftonepred3RBF.mat. opts.kerneltype = 2; opts.kernelparam = 'findeach'; % other opts as before PErbf = load('probest_rbf.dat'); highestconf_rbf = max(PErbf'); [bestelemwise,bestseqwise] = choosebestscore(highestconf_rbf,[1:3:11520],'last'); rbfconfpred_pip=zeros(11520,1); for i=1:3840, b=bestseqwise(i); for j=1:b-1 rbfconfpred_pip(3*(i-1)+j)=1; end; rbfconfpred_pip(3*(i-1)+b)=2; for j=b+1:3 rbfconfpred_pip(3*(i-1)+j)=3; end; end; [FORFOLDS_rbf,FORSPLITS_rbf,DETAILS_rbf,COMBINED_rbf] = splitbysomething (X,L,train4,test4,rbfconfpred_pip,{1,2,3},opts); |