Tone Recognition in Mandarin

Dinoj Surendran and Gina-Anne Levow and Yi Xu

Phase 1 : Studying the interaction of tone and focus on a clean, focus-marked, lab speech data set

Data : tonefocus_is05.mat (zipped, 1.7Mb) : Matlab file with this data from Xu (1999) analyzed in Surendran, Levow & Xu (Proc. ICSLP/Eurospeech 2005), henceforth referred to as SLX'05. If you ever use it call it... say... XuTF99.

  Name             Size                   Bytes  Class

  L            11520x1                    92160  double array
  X            11520x20                 1843200  double array
  focus_npip   11520x1                    92160  double array
  sentence     11520x1                    92160  double array
  spkr         11520x1                    92160  double array
  test4            1x4                    92400  cell array
  train4           1x4                   276720  cell array

Each of the 11520 data points is a Mandarin syllable in lab speech that Xu collected for his 1999 JPhonetics paper from eight native speakers (four male, one female).

  • L(i) has the tone of the i-th syllable There are no neutral-tone syllables here, so the values in L are from 1 to 4.

  • X(i,:) has a speaker-normalized pitch contour of the syllable, sampled at 20 points across its duration. Speaker-normalization means that you don't need to take the speaker into account when classifying. The normalization is just z-values taken over all syllables spoken by the same speaker.

  • test4 and train4 have indices used in all the 4-fold cross-validation experiments reported in SLX05. train4{i} has the indices of examples (a subvector of [1:11520]) used for training in the i-th fold, while test4{i} has the indices used for testing. We split these so that each training set had all syllables of six speakers and each test set had all syllables from the remaining two speakers. (This was created using the command [train4,test4] = makefolds(spkr,4))

    The above is all you need to predict tone from pitch (within the syllable anyway). But, we continue...

  • spkr(i) has the speaker of the i-th syllable. Since the first 1440 syllables were spoken by the first speaker, the next 1440 by the second, and so on, this was just created with spkr=ceil(([1:11520])/1440)'.

  • focus_npip(i) is 0 if the syllable was in a sentence without focus and 1/2/3 if it was before/in/after (respectively) the focused syllable of the sentence. In case you're wondering, npip stands for no/pre/in/post .

  • sentence(i) is the number (from 1 to 3840) of the sentence in which this syllable was spoken. Each sentence has (in this dataset) 3 syllables, with the first three syllables are from the first sentence, the next three from the second sentence, and so on, i.e. sentence=ceil([1:11520]/3)';. (Actually, there were five syllables per sentence, but the first and last tones were always the same and therefore not in this dataset - if anyone wants them, ping me or Yi).

Tone Recognition using Focus (Surendran, Levow, Xu), Proceedings of Eurospeech/ICSLP 2005

Poster

Partiview 3d model (download, unzip, click, pray, see readme) showing the perfomance of the baseline svm on tone recognition. Uses the Parametric Embedding (Iwata et al, 2004) algorithm from LIBSVM's output with the "-b" flag (more documentation on this coming). Some pictures of this are below. Click to expand.

Another 3d model: tonefocus_slx05.zip : has same data (4-th fold) but when different classifiers are created for three different focus condition groups (predicted by the confidence of tone classification). The attributes are "focuscond" with values 0 (no-focus), 1 (pre-focus), 2 (in-focus), 3 (post-focus) and "tone" with values 1,2,3,4. It was made with

load tonefocus_is05.mat
load splitbyconftonepred3LIN.mat
for i=1:2880,pic4{i}=sprintf('syllz%d.sgi',test4{4}(i));end
ndaona('publish','tonefocus_slx05', 'classprobs',FORFOLDS_lin.pe{4}, 'picdir','./images',
  'pics',pic4,'attrib',[focus_npip(test4{4}) L(test4{4})],'attribnames',{'focuscond','tone'},
  'classes',L(test4{4}),'classnames',{'hi','rise','lo','fall'},'glyphsize',5);

Some Results

One SVM for everything

RBF kernel: results_rbf_alltogether.mat. This was created with

[pred,wts,nc,cm,acc,probest,optsused] = batchtest(X,L,train4,test4,opts);

opts =

        doscale: 0
     getweights: 0
       doweight: 1.0000e-03
      libsvmdir: '/export/d2/scratch/dinoj/libsvm-2.8'            %     modify this to be the location of your svm-train
         nfolds: 5
    kernelparam: 'findeach'
     kerneltype: 2
    libsvmflags: '-m 1000'
      doprobest: 1
         labels: [1 2 3 4]

The probability estimates by themselves can be found in probest_rbf.dat. It is a 11520 x 4 matrix with the (i,j)-th entry representing the probability (according to LIBSVM's -b calculation) that the i-th point belongs to class j.

Linear kernel: results_lin_alltogether.mat. This was created with the same call as above, but with opts.kerneltype first set to 0.

The probability estimates by themselves can be found in probest_lin.dat

Predicting by groups of predicted focus (when focus known in training)

Using tonefocus_is05_version2.mat

  Name             Size                   Bytes  Class

  L            11520x1                     92160  double array
  X            11520x20                  1843200  double array
  X_PI_ALL     11520x52                  4792320  double array
  Ysent_foc     3840x1                     30720  double array
  focus_npip   11520x1                     92160  double array
  isfocused    11520x1                     92160  double array
  sentence     11520x1                     92160  double array
  spkr         11520x1                     92160  double array
  test4            1x4                     92400  cell array
  train4           1x4                    276720  cell array

Ysent_foc(j) is the focus condition of the j-th sentence/phrase. It equals n if the phrase has n-focus. n is between 0 and 3 inclusive.

X_PI_ALL is, like X, a set of features for each syllable. Its columns are described below (and yes, its first 20 columns equal X).

  • Columns 1 to 20 : pitch contour
  • Columns 21 to 40 : intensity contour
  • 41: mean pitch of syllable
  • 42: mean pitch of syllable - mean pitch of preceding syllable (or 0 if this syllable is the first syllable in its sentence)
  • 43: mean pitch of syllable - mean pitch of following syllable (or 0 if this syllable is the last syllable in its sentence)
  • 44: pitch range (max pitch - min pitch) of syllable
  • 45: pitch range of syllable - pitch range of preceding syllable (or 0 if this syllable is the first syllable in its sentence)
  • 46: pitch range of syllable - pitch range of following syllable (or 0 if this syllable is the last syllable in its sentence)
  • 47: mean intensity of syllable
  • 48: mean intensity of syllable - mean intensity of preceding syllable (or 0 if this syllable is the first syllable in its sentence)
  • 49: mean intensity of syllable - mean intensity of following syllable (or 0 if this syllable is the last syllable in its sentence)
  • 50: intensity range (max intensity - min intensity) of syllable
  • 51: intensity range of syllable - intensity range of preceding syllable (or 0 if this syllable is the first syllable in its sentence)
  • 52: intensity range of syllable - intensity range of following syllable (or 0 if this syllable is the last syllable in its sentence)

These additional features were created using (initially Xintensity was what is now X_PI_ALL(:,21:40)) :

  X_pitch_features = createNBRfeatures(X,[1:3:11520]);
  X_intensity_features = createNBRfeatures(Xintensity,[1:3:11520]);
  X_PI_ALL = [X  Xintensity  X_pitch_features X_intensity_features];

isfocused is a binary vector; isfocused(j) is 1 iff the j-th syllable has focus OR if the j-th syllable is the final syllable in a 0-focus sentence. (This is so that 0-focus sentences are treated like 3-focus sentences.) It wass created using

  isfocused=zeros(11520,1); 
  for i=1:3840,
    a=3*(i-1);
    if Ysent_foc(i),
      isfocused(a+Ysent_foc(i))=1;
    else 
      isfocused(a+3)=1;
    end;
  end

Linear SVM

  opts2.labels=[0 1];
  opts2.kerneltype = 0;
  opts2.doweight = 0.001;
  opts2.doscale = 0;
  opts2.getweights = 0;
  opts2.libsvmdir = '/home/dinoj/libsvm-2.8';             % change for your system
  opts2.libsvmflags = '-m 1000';
  opts2.doprobest = 1;

  [pred,wts,nc,cm,acc,probest,optsused] =   batchtest(X_PI_ALL,isfocused,train4,test4,opts2);   % train focus predictor
  pe_isfocused_lin_PI_ALL=zeros(11520,2); 
  for i=1:4,pe_isfocused_lin_PI_ALL(test4{i},:)=probest{i};end;
  conf_focuspred_lin_PI_ALL = pe_isfocused_lin_PI_ALL(:,2);             % i-th entry is confidence of focus predictor that i-th syllable is focused
  [bestelemwise_predfocused_lin_PI_ALL,bestseqwise_predfocused_lin_PI_ALL]=choosebestscore(conf_focuspred_lin_PI_ALL,[1:3:11520],'last');
  linconfpred_pip_PI_ALL = zeros(11520,1);
  for i=1:3840,
    b=bestseqwise_predfocused_lin_PI_ALL(i);
    for j=1:b-1,
       linconfpred_pip_PI_ALL(3*(i-1)+j)=1;
    end;
    linconfpred_pip_PI_ALL(3*(i-1)+b)=2;
    for j=b+1:3,
       linconfpred_pip_PI_ALL(3*(i-1)+j)=3;
    end;
  end;
  [cm_syll_PI_ALL,nc_syll_PI_ALL]=getcm(focus_npip,linconfpred_pip_PI_ALL,[0:3]);
  [cm_sent_PI_ALL,nc_sent_PI_ALL]=getcm(Ysent_foc,bestseqwise_predfocused_lin_PI_ALL,[0:3]);
  [FORFOLDS_lin_PI_ALL,FORSPLITS_lin_PI_ALL,DETAILS_lin_PI_ALL,COMBINED_lin_PI_ALL] = splitbysomething (X,L,train4,test4,linconfpred_pip_PI_ALL,{1,2,3},opts);

The result of the above is saved in predfocus_lin_PI_ALL.mat.

RBF Kernel

  optsrbf.labels=[0 1];
  optsrbf.kerneltype = 2;
  optsrbf.kernelparam = 'findeach';
  optsrbf.doweight = 0.001;
  optsrbf.doscale = 0;
  optsrbf.getweights = 0;
  optsrbf.libsvmdir = '/export/d2/scratch/dinoj/libsvm-2.8';             % change for your system
  optsrbf.libsvmflags = '-m 1000';
  optsrbf.doprobest = 1;
  optsrbf.nfolds = 5;

  [pred,wts,nc,cm,acc,probest,optsused] =   batchtest(X_PI_ALL,isfocused,train4,test4,optsrbf);   % train focus predictor

  save ~/html/projects/tonefocus/predfocus_rbf_PI_ALL.mat pred nc cm acc probest optsused

  pe_isfocused_rbf_PI_ALL=zeros(11520,2); 
  for i=1:4,pe_isfocused_rbf_PI_ALL(test4{i},:)=probest{i};end;
  conf_focuspred_rbf_PI_ALL = pe_isfocused_rbf_PI_ALL(:,2);             % i-th entry is confidence of focus predictor that i-th syllable is focused
  [bestelemwise_predfocused_rbf_PI_ALL,bestseqwise_predfocused_rbf_PI_ALL]=choosebestscore(conf_focuspred_rbf_PI_ALL,[1:3:11520],'last');
  rbfconfpred_pip_PI_ALL = zeros(11520,1);
  for i=1:3840,
    b=bestseqwise_predfocused_rbf_PI_ALL(i);
    for j=1:b-1,
       rbfconfpred_pip_PI_ALL(3*(i-1)+j)=1;
    end;
    rbfconfpred_pip_PI_ALL(3*(i-1)+b)=2;
    for j=b+1:3,
       rbfconfpred_pip_PI_ALL(3*(i-1)+j)=3;
    end;
  end;
  [cm_syll_PI_ALL,nc_syll_PI_ALL]=getcm(focus_npip,rbfconfpred_pip_PI_ALL,[0:3]);
  [cm_sent_PI_ALL,nc_sent_PI_ALL]=getcm(Ysent_foc,bestseqwise_predfocused_rbf_PI_ALL,[0:3]);

  optsrbf.labels=[1:4];

  save ~/html/projects/tonefocus/predfocus_rbf_PI_ALL.mat pred nc cm acc probest optsused *rbf*PI_ALL

  [FORFOLDS_rbf_PI_ALL,FORSPLITS_rbf_PI_ALL,DETAILS_rbf_PI_ALL,COMBINED_rbf_PI_ALL] = splitbysomething (X,L,train4,test4,rbfconfpred_pip_PI_ALL,{1,2,3},optsrbf);

  save ~/html/projects/tonefocus/predfocus_rbf_PI_ALL.mat pred nc cm acc probest optsused *rbf*PI_ALL

The result of the above is saved in predfocus_rbf_PI_ALL.mat.

Predicting by Confidence-predicted focus (when focus not known during training)

Results of running the below lines are in splitbyconftonepred3LIN.mat.

  
  opts.kerneltype = 0;
  opts.labels = [1 2 3 4];
  opts.doweight = 0.001;
  opts.doscale = 0;
  opts.getweights = 0;
  opts.libsvmdir = '/home/dinoj/libsvm-2.8';             % change for your system
  opts.nfolds = 5;
  opts.libsvmflags = '-m 1000';
  opts.doprobest = 1;

  PElin = load('probest_lin.dat') ;

  highestconf_lin = max(PElin');

  [bestelemwise,bestseqwise] = choosebestscore(highestconf_lin,[1:3:11520],'last');

  linconfpred_pip=zeros(11520,1);
  for i=1:3840,
    b=bestseqwise(i);
    for j=1:b-1,
       linconfpred_pip(3*(i-1)+j)=1;
    end;
    linconfpred_pip(3*(i-1)+b)=2;
    for j=b+1:3,
       linconfpred_pip(3*(i-1)+j)=3;
    end;
  end;

 [FORFOLDS_lin,FORSPLITS_lin,DETAILS_lin,COMBINED_lin] = splitbysomething (X,L,train4,test4,linconfpred_pip,{1,2,3},opts);



Results of running the below lines are in splitbyconftonepred3RBF.mat.

  opts.kerneltype = 2;
  opts.kernelparam = 'findeach';                % other opts as before

  PErbf = load('probest_rbf.dat');

  highestconf_rbf = max(PErbf');

  [bestelemwise,bestseqwise] = choosebestscore(highestconf_rbf,[1:3:11520],'last');

  rbfconfpred_pip=zeros(11520,1);
  for i=1:3840,
    b=bestseqwise(i);
    for j=1:b-1
      rbfconfpred_pip(3*(i-1)+j)=1;
    end;
    rbfconfpred_pip(3*(i-1)+b)=2;
    for j=b+1:3
      rbfconfpred_pip(3*(i-1)+j)=3;
    end;
  end;

  [FORFOLDS_rbf,FORSPLITS_rbf,DETAILS_rbf,COMBINED_rbf] = splitbysomething (X,L,train4,test4,rbfconfpred_pip,{1,2,3},opts);