Sequence Classification

Data format

We always assume that we are dealing with HMM-type data, i.e. there is a one-to-one correspondence between observations and labels. Thus we can speak of Sequential (or Seq11) and Flat format.

Sequential : Suppose you have a collection of N sequences, with the K-th sequence having NK elements, each element being a D-dimensional vector. Each vector is labelled with an integer.

Flat : this is just a collection of N=sumk Nk vectors with the same number of labels.

Matlab format

Sequential data: The Matlab struct X has two fields

    X.data
    X.labels
    X.origlabels
    X.comments

X.origlabels is a vector of the integers used as class labels. The labels in X.labels{:} are always from 1 to L=length(X.origlabels) but the labels the user supplies may be different. This keeps track of them, so a label k in X.data actually refers to X.origlabels(k).

X.comments is empty or a cell array of comments.

X.data is a cell array, with X.data{i} having a Ni x D (sparse or dense) matrix representing the Ni vectors in the i-th sequence.
X.labels{i} is a 1 x Ni matrix representing the corresponding labels.

File Format for sequential data

This has the following options (see examples below)

Tag/Space-separated. In the first, sequences are separated by blank lines. In the second, they are separated by <sequence length=..&rt;...</sequence&rt; tags

Sparse/Dense. In the first, data is stored as index-value pairs. In the second, all D values are stored.

DimHeader (Present or absent) If present, the line <dimension=D> appears on the first line.

LabHeader (Present or absent) If present, the line <labels=L1 L2 ... Ln> appears after (if any) the dimension header.

Examples : the data below shows a collection of two sequences with 3 and 4 2-dimensional vectors respectively. Each of the 7 vectors in the collection is labelled -1 or 1.

Tag-separated, dense, DimHeader, LabHeader

<dimension=2>
<labels -1 1>
<sequence length=3>
1 0.335 0.312 
-1 0.121  -0.112
1 0.954 0
</sequence length>
<sequence length=4>
-1 -0.45 -0.1
-1 -1.21 0.24
1 0.00 4.21
-1 -0.12 -1.2141
</sequence>

Tag-separated, dense, no headers

<sequence length=3>
1 0.335 0.312 
-1 0.121  -0.112
1 0.954 0
</sequence length>
<sequence length=4>
-1 -0.45 -0.1
-1 -1.21 0.24
1 0.00 4.21
-1 -0.12 -1.2141
</sequence>

Space-separated, dense, LabHeader

<labels -1 1>
1 0.335 0.312 
-1 0.121  -0.112
1 0.954 0

-1 -0.45 -0.1
-1 -1.21 0.24
1 0.00 4.21
-1 -0.12 -1.2141

Space-separated, sparse, Dimension header

<dimension=2>
1 1:0.335 2:0.312 
-1 1:0.121  2:-0.112
1 1:0.954 

-1 1:-0.45 2:-0.1
-1 1:-1.21 2:0.24
1 2:4.21
-1 1:-0.12 2:-1.2141

Space-separated, sparse

1 1:0.335 2:0.312 
-1 1:0.121  2:-0.112
1 1:0.954 

-1 1:-0.45 2:-0.1
-1 1:-1.21 2:0.24
1 2:4.21
-1 1:-0.12 2:-1.2141

Files

seqread.m. Reads from a text file in sequential or flat format.

seqwrite.m. Writes to a text file in sequential or flat format.