Dialog Act Classification

Combining Text and Prosodic Features with Support Vector Machines

Dinoj Surendran, Gina-Anne Levow

Each point represents a dialog act in the HCRC Maptask data set, with dialog acts of the same type colored the same. Points that are close together were classified very similarly by a linear SVM using text and prosodic features. Classification accuracy was 64.5%.

Papers: There are two versions of this paper. The first version was accepted at ASRU 2005 but had to be withdrawn when the first author could not travel owing to passport/visa issues. (The second author couldn't go due to scheduling conflicts, and some author had to be there to present it.) The second version, just submitted to ICSLP 2006, is like the first, but with the five pages squeezed down to four pages.

Interactive 3d model from which the above picture was made.
Dialog ActDescription
instructionCommands partner to carry out action
explanationStates information that partner did not elicit
alignChecks attention & agreement of partner, or their readiness for next dialog act
checkRequests partner to confirm information that checker is partially sure of
query-ynYes/No question other than a check or align
query-wAny other question
acknowledgeMinimal verbal rseponse showing that speaker has heard the preceding dialog act
clarifyrepetition of information already stated by speaker, often in response to a check dialog act
reply-yaffirmative reply to any query
reply-nnegative reply to any query
reply-wany other reply
readydialog act that occurs after end of a dialog game and prepares conversation for a new game

Data

If you are a machine learning person who just wants numbers to crunch, you need the files

Rawer data

moves_normpitch.zip (91 Mb - link not available owing to bandwidth and copyright regulations - if you already have access via the LDC to the raw data and want this processed version, email me) : this has 64 files of the form q[1-8]nc[1-8].moves.normpitch representing 64 conversations of two speakers, a giver and a follower, who are having a task-oriented dialogue with no eye contact.

(Not publicly available : raw wav files : in wavfiles12/34/56/78.zip, with contents like q1nc2_wav_43.dat having samples for the 43rd utterance in conversation q1nc2. Each zip file is about 180 Mb.)

Each file is of the form

%   zvalue_of_momelized_pitch   momelized_pitch  zvalue_of_raw_pitch   raw_pitch_from_esps   prob_voicing   zvalue_of_local_intensity  local_intensity
%      (actually prob_voicing is the normalized crosscorrelation that ESPS uses to determine the output pitch value -
%         officiially ESPS gives it as 1 if the raw pitch is 0 and 0 otherwise)
%     (z values are computed by subtracting the mean and then dividing by the standard deviation...  with the
%       exception of z value for raw pitch, which remains 0 if the raw pitch is 0 i.e. not found)

<move num=1 conv=q1nc1  speaker=G  label=ready  numFrames=103  duration=1.03000   words=[ehm SILENCE right]  startframe_esps=0 endframe_esps=102 startframe_wav=0 endframe_wav=20532>
1.03657002837298 239.78 0.0 0.0 0.767700254917 -0.806086071602533 0.0
1.02391681965019 239.12 0.0 0.0 0.33472892642 -0.662385096566971 50.5230903625
1.00282813844553 238.02 0.0 0.0 0.535148739815 -0.308826713784724 174.828872681
0.973303984759003 236.48 0.0 0.0 0.197349309921 0.108718274569066 321.631378174
0.935344358590615 234.5 0.0 0.0 0.472838759422 -0.449386645306214 125.410125732
0.888949259940363 232.08 0.0 0.0 0.609551012516 0.009582051313656 286.776580811
0.834118688808247 229.22 0.977752764603535 236.712051392 0.669899046421 0.198904807185349 353.339599609
...
-0.511358343577427 159.039 -1.74353578208344 94.7677841187 0.707227408886 -0.197000152621103 214.145401001
-0.75276622878568 146.447 -1.66435873898597 98.8977127075 0.389831155539 -0.182989673063866 219.071273804
</move>
<move num=2 conv=q1nc1  speaker=G  label=instruct  numFrames=130  duration=1.30000   words=[you start at the caravan park]  startframe_esps=102 endframe_esps=231 startframe_wav=20532 endframe_wav=46376>
-1.52206214760331 106.32 -1.66435873898597 98.8977127075 0.389831155539 -0.182989673063866 219.071273804
...

The first few lines (starting with % or with only whitespace) are comments. After this, each dialog act is enclosed between 'move' tags. Jargon note : 'move' equals 'dialog act'.

For example, consider the first dialog act

Processing Text Features

The file you want to work with is ngrams.pl.

This file can be used in one of two ways. The first is to create a file of words that can then (in the second usage) be used to create a sparse vector of words for a new dialog act. The first usage is for training data, the second is for test data.

Training Usage

   perl ngrams.pl [-o] [-n] [-w] [-s featuresfile to save] [-c listofconvfiles (or conv1 ... convN)] 
         [-e maximum_relative_entropy] [-f minimum_frequency] [-x maximum_number_of_n_for_which_to_get_ngrams] 

This is best explained by example. Suppose you say

   perl ngrams.pl -s wordseg.txt -c q1nc1 q1nc2 -x 2

This means that only the files q1nc1.moves.normpitch and q1nc2.moves.normpitch were analyzed. You could also have listed

q1nc1
q1nc2

in a file called, say, trainfiles.txt and said

   perl ngrams.pl -s wordseg.txt -c trainfiles.txt -x 2
This would be equivalent. Anyway, this command would create the file wordseg.txt with details of all unigrams and bigrams in conversations q1nc1 and q1nc2. If you had wanted trigrams as well, you would have said "-x 3" and if you'd wanted just unigrams you'd have said "-x 1". Anyway, the first few entries of wordseg.txt look like this:
  7 0.372599906634556 # [and_SILENCE]
  6 0.518396725840501 # [SILENCE_at]
  7 0.49787893696661 # [go_straight]
  5 0.195092515991592 # [along_the]
  15 0.385252917924331 # [SILENCE_just]
  8 0.514976383181526 # [mill_SILENCE]
  16 0.608035847461469 # [what]
  13 0.451076213641237 # [and_you]
  23 0.499099348179897 # [SILENCE_NOISE(NONVOCAL)]
  56 0.678022537022683 # [going]
  8 0.489476457508413 # [down_from]
  ...

This means, for instance, that the bigram "along the" occurred 5 times in q1nc1 and q1nc2 and that the word "going" occurred 56 times. The second column represents 'relative entropy' the distribution of the ngram in dialog acts of different classes. It is high for ngrams that occur uniformly over different classes of dialog acts, and low for ngrams that occur more often in certain classes. Useful ngrams have high frequency (so they are more likely to occur in test examples) and low relative entropy (as they give more classification-useful information when they occur).

Suppose only want those ngrams that have frequency at least 5 and relative entropy at most 0.5 . Then you'd say

   perl ngrams.pl -s wordseg1.txt -c trainfiles.txt -x 2 -e 0.5 -n 5

The order in which you give the flags makes no difference. Anyway, wordseg1.txt looks like :

  7 0.372599906634556 # [and_SILENCE]
  7 0.49787893696661 # [go_straight]
  5 0.195092515991592 # [along_the]
  15 0.385252917924331 # [SILENCE_just]
  13 0.451076213641237 # [and_you]
  23 0.499099348179897 # [SILENCE_NOISE(NONVOCAL)]
  8 0.489476457508413 # [down_from]
  ...

Note that entries like "SILENCE_AT" have disappeared.

There are some more flags you can add that have no arguments.

Test Usage

   perl ngrams.pl [-r featuresfile] [-c listofconvfiles (or conv1 ... convN)] 

featuresfile is simply the file (e.g. wordseg.txt) created using the -s flag in the training usage.

The -c flag has the same usage as before.

So for example, if you had said in training

  perl ngrams.pl -s wordseg5.txt -c q1nc1 q1nc2 -x 2 -n 5 -e 0.5 -o -n -w

And got wordseg5.txt, you could say

  perl ngrams.pl -r wordseg5.txt -c q1nc3 q1nc4

to produce the files q1nc3_text.dat and q1nc4_text.dat.

If you would like your output files to be called, say, q1nc3_text_blah.dat instead, just add the flag -t blah.

q1nc3_text.dat looks like

12 1:1
12 1:1
1 129:1 1:2 80:1
2 1:8 148:1 80:1
4 71:1 1:5
9 1:2 43:1
4 1:2 27:1
...

The structure of this is <classlabel> <index1>:<val1> <index2>:<val2> ... <indexk>:<valk> . Index 1 stands for the feature 'this word did not occur in featuresfile' while otherwise index j+1 stands for the j-th word in featuresfile.

For instance, here, the first two dialog acts of q1nc3 are both of class 12, and have a single word that was not in the 'words' specified by wordseg5.txt . The third dialog act has class label 1, and has the 128th and 79th feature of featuresfile and two words not in featuresfile.

A matlab script

makedataset.m

    [X,L,trainindices,testindices] = makedataset ;

The result of this is in the workspace maptasknc_textfeats.mat.

  L              14810x1                    118480  double array
  X                  1x4                   6386636  cell array
  testindices        1x4                    118720  cell array
  trainindices       1x4                    355680  cell array

This represents four splits of 14810 data into non-overlapping test and training sets. The i-th split (i=1,2,3,4) uses as training data X{i}(trainindices{i},:) and as test data X{i}(testindices{i},:) . There are different X{i} because the n-grams consist of all unigrams, bigrams, and trigrams that occur at least twice in the training data (plus unigram_1 features) and thus depend on the training data.

The text features are in textfeats.zip.

If you know about Maptask's structure in conversations The data comes from 64 conversations. The test data in the four splits form a partition of the 14810 moves (each data point = a move):

Classification code

Please see revhmm.m and the LIBSVM-related functions in my Matlab scripts page.