Firstly, if you want to run this on the uChicago CS Dept network, you need to run Matlab 7; type /opt/cs-matlab/7.1/bin/matlab to do so.
We have N sequences of D-dimensional vectors, and the n-th sequence is of length Ln. Each vector is to be classified into one of K classes.
Y and X are both cell arrays with N elements
Y{n} is a 1 x Ln vector of labels (each label is a number between 1 and K inclusive)
X{n} is a D x Ln matrix
I am not entirely sure how this works, but looking at CRF1D/crfchainDemo.m resulted in the wrapper files below. unfortunately these only work for the special case that all sequence are of the same length i.e. Ln = L for all n
Place the following files in a directory on matlab's path with the CRF Toolbox installed. xtrain and xtest have a format like X, while ytrain and ytest have a format like Y above.
train file : crf1dtrain.m
Usage: ch = crf1dtrain(xtrain,ytrain,number_of_classes);
test file : crf1dtest.m
(this uses getcm.m)
Usage: result = crf1dtest(xtest,ytest,ch);
result.cm has a confusion matrix and result.ac has overall accuracy.
I tested this on some of my data, and got good results - 93% for a problem that I expect about 85-90% on... it could have been a data subsetting, but at least it's on the right track... This trained on 480 sequences each with 3 20-dimensional vectors, representing lab speech by a chinese speaker. Each vector represented a syllable with one of four tones. Testing was on a different speaker. The training time was 5 minutes.
Example data: xtf99syll.mat (zipped, 2Mb) with the following fields
spkr 1x3840 30720 double array x 1x3840 2073600 cell array yfocus 1x3840 322560 cell array ytone 1x3840 322560 cell array
For full details of this data, click here - the summary is that it is are 3840 phrases of Mandarin lab speech by 8 native speakers, recorded by Yi Xu for a 1999 paper. Each phrase here has three syllables. Each syllable has a 20-dimensional vector (representing its normalized pitch contour), one of 4 possible tones, and 2 possible focus conditions (focused or not focused).
For the j-th syllable in the i-th phrase, the 20-dimensional pitch vector is in x{i}(:,j), its tone is in ytone{i}(j) (1=high, 2=rising, 3=low, 4=falling) and its focus condition is in yfocus{i}(j) (1 = focused, 2 = not focused) and its speaker is in spkr(i) (an integer from 1 to 8)
The pitch contour is z-normalized by speaker, so you shouldnt have to worry about scaling issues; values are between -4 and 4.
tr = [1:240];
te = [241:480];
ch = crf1dtrain(x(tr),ytone(tr),4); % 4 = number of tone classes
result = crf1dtest (x(te),ytone(te),ch);
This produces no output for about 90 seconds, then starts producing iteration results, and finishes after another 30 seconds. The test function works instantly, and returns several results, some of which are from the crf functions and others are classification statistics produced by getcm.
>> result
result =
----------- PREDICTIONS -------------------------
pred: [240x3 double] % labels predicted by CRF;
% pred(i,j) = predicted tone for
% j-th syllable of i-th sequence
yflat: [720x1 double] % yflat(3*(i-1)+j) = actual tone of
% j-th syllable of i-th sequence
predflat: [1x720 double] % predflat(3*(i-1)+j) = predicted tone of
% j-th syllable of i-th sequence
------------ PROBABILITIES ESTIMATED BY CRF -------
(I THINK!)
probest: {1x240 cell} % probest{i}(j,:) = prob that
% j-th syllable of i-th sequence has tone k
% equals bel' returned by bpchaininfer
probestflat: [720x4 double] % probestflat(3*(i-1)+j,:) = probest{i}(j,:)
probtrans: {1x240 cell} % probtrans{i}(c1,c2,j) =
% probability that in i-th sequence,
% j-th syllable has tone c1
% (j+1)-th syllable has tone c2
% equals belE returned by bpchaininfer
mostconfflat: [1x720 double] % mostconfflat(3*(i-1)+j) = max(probest{i}(j,:))
%
logZ: [1x240 double] % logZ(i) = normalization factor
% increases with likelihood of sequence (?)
------------ CLASSIFICATION RESULTS -------------
cm: [4x4 double] % confusion matrix
nc: 525 % number of correct answers
ac: 0.7292 % classification accuracy
>> result.cm
ans =
191 6 0 3
2 76 0 2
3 66 164 7
99 4 3 94
Btw, the main reason for the poor accuracy is the difference in class distributions for test and training set (be careful with subsets, since the distribution of tones and focus is not random over subsets of [1:480]). If we train on all 480 sequences of the first speaker and test on all 480 sequences of the second speaker, the accuracy is about 93%.
Overall distribution of the four tones:
4160 2240 2880 2240
Distribution of tones in ytone(1:240)
320 200 120 80
Distribution of tones in ytone(241:480)
200 80 240 200