Some notes on using Kevin Murphy's CRF Toolbox

Firstly, if you want to run this on the uChicago CS Dept network, you need to run Matlab 7; type /opt/cs-matlab/7.1/bin/matlab to do so.

We have N sequences of D-dimensional vectors, and the n-th sequence is of length Ln. Each vector is to be classified into one of K classes.

   Y and X are both cell arrays with N elements
     Y{n} is a 1 x Ln vector of labels (each label is a number between 1 and K inclusive)
     X{n} is a D x Ln matrix 

I am not entirely sure how this works, but looking at CRF1D/crfchainDemo.m resulted in the wrapper files below. unfortunately these only work for the special case that all sequence are of the same length i.e. Ln = L for all n

Place the following files in a directory on matlab's path with the CRF Toolbox installed. xtrain and xtest have a format like X, while ytrain and ytest have a format like Y above.

train file : crf1dtrain.m
Usage: ch = crf1dtrain(xtrain,ytrain,number_of_classes);

test file : crf1dtest.m (this uses getcm.m)
Usage: result = crf1dtest(xtest,ytest,ch);

result.cm has a confusion matrix and result.ac has overall accuracy.

I tested this on some of my data, and got good results - 93% for a problem that I expect about 85-90% on... it could have been a data subsetting, but at least it's on the right track... This trained on 480 sequences each with 3 20-dimensional vectors, representing lab speech by a chinese speaker. Each vector represented a syllable with one of four tones. Testing was on a different speaker. The training time was 5 minutes.


Example data: xtf99syll.mat (zipped, 2Mb) with the following fields

  spkr         1x3840                  30720  double array
  x            1x3840                2073600  cell array
  yfocus       1x3840                 322560  cell array
  ytone        1x3840                 322560  cell array

For full details of this data, click here - the summary is that it is are 3840 phrases of Mandarin lab speech by 8 native speakers, recorded by Yi Xu for a 1999 paper. Each phrase here has three syllables. Each syllable has a 20-dimensional vector (representing its normalized pitch contour), one of 4 possible tones, and 2 possible focus conditions (focused or not focused).

For the j-th syllable in the i-th phrase, the 20-dimensional pitch vector is in x{i}(:,j), its tone is in ytone{i}(j) (1=high, 2=rising, 3=low, 4=falling) and its focus condition is in yfocus{i}(j) (1 = focused, 2 = not focused) and its speaker is in spkr(i) (an integer from 1 to 8)

The pitch contour is z-normalized by speaker, so you shouldnt have to worry about scaling issues; values are between -4 and 4.


Example: Suppose we want to do tone classification, and want to train with the first 240 phrases (=sequences) of the first speaker and test on the remaining 240 phrases of the first speaker. Then we'd say
         tr = [1:240];
         te = [241:480];
         ch = crf1dtrain(x(tr),ytone(tr),4);   % 4 = number of tone classes
         result = crf1dtest (x(te),ytone(te),ch);

This produces no output for about 90 seconds, then starts producing iteration results, and finishes after another 30 seconds. The test function works instantly, and returns several results, some of which are from the crf functions and others are classification statistics produced by getcm.

>> result

result = 
       ----------- PREDICTIONS -------------------------

            pred: [240x3 double]  % labels predicted by CRF; 
                                  % pred(i,j) = predicted tone for 
                                  %   j-th syllable of i-th sequence
           yflat: [720x1 double]  % yflat(3*(i-1)+j) = actual tone of 
                                  %   j-th syllable of i-th sequence
        predflat: [1x720 double]  % predflat(3*(i-1)+j) = predicted tone of 
                                  %   j-th syllable of i-th sequence

       ------------ PROBABILITIES ESTIMATED BY CRF -------
                          (I THINK!)

         probest: {1x240 cell}    %   probest{i}(j,:) = prob that
                                  %   j-th syllable of i-th sequence has tone k
                                  %   equals bel' returned by bpchaininfer
     probestflat: [720x4 double]  % probestflat(3*(i-1)+j,:) = probest{i}(j,:)
       probtrans: {1x240 cell}    % probtrans{i}(c1,c2,j) = 
                                  %   probability that in i-th sequence, 
                                  %     j-th syllable has tone c1
                                  %     (j+1)-th syllable has tone c2 
                                  %   equals belE returned by bpchaininfer    
    mostconfflat: [1x720 double]  % mostconfflat(3*(i-1)+j) = max(probest{i}(j,:))
                                  % 
            logZ: [1x240 double]  % logZ(i) = normalization factor 
                                  %    increases with likelihood of sequence (?)
       ------------ CLASSIFICATION RESULTS -------------

              cm: [4x4 double]   % confusion matrix
              nc: 525            % number of correct answers
              ac: 0.7292         % classification accuracy

>> result.cm

ans =

   191     6     0     3
     2    76     0     2
     3    66   164     7
    99     4     3    94


Btw, the main reason for the poor accuracy is the difference in class distributions for test and training set (be careful with subsets, since the distribution of tones and focus is not random over subsets of [1:480]). If we train on all 480 sequences of the first speaker and test on all 480 sequences of the second speaker, the accuracy is about 93%.

Overall distribution of the four tones:

       4160   2240   2880   2240

Distribution of tones in ytone(1:240)

        320    200    120     80

Distribution of tones in ytone(241:480)

        200     80     240   200