|
|
| Each point represents a dialog act in the HCRC Maptask data set, with dialog acts of the same type colored the same. Points that are close together were classified very similarly by a linear SVM using text and prosodic features. Classification accuracy was 64.5%. |
Papers: There are two versions of this paper. The first version was accepted at ASRU 2005 but had to be withdrawn when the first author could not travel owing to passport/visa issues. (The second author couldn't go due to scheduling conflicts, and some author had to be there to present it.) The second version, just submitted to ICSLP 2006, is like the first, but with the five pages squeezed down to four pages.
Interactive 3d model from which the above picture was made.
| Dialog Act | Description |
| instruction | Commands partner to carry out action |
| explanation | States information that partner did not elicit |
| align | Checks attention & agreement of partner, or their readiness for next dialog act |
| check | Requests partner to confirm information that checker is partially sure of |
| query-yn | Yes/No question other than a check or align |
| query-w | Any other question |
| acknowledge | Minimal verbal rseponse showing that speaker has heard the preceding dialog act |
| clarify | repetition of information already stated by speaker, often in response to a check dialog act |
| reply-y | affirmative reply to any query |
| reply-n | negative reply to any query |
| reply-w | any other reply |
| ready | dialog act that occurs after end of a dialog game and prepares conversation for a new game |
If you are a machine learning person who just wants numbers to crunch, you need the files
L 14810x1 118480 double array X 1x4 6386636 cell array testindices 1x4 118720 cell array trainindices 1x4 355680 cell array
This represents four splits of 14810 data into non-overlapping test and training sets. The i-th split (i=1,2,3,4) uses as training data X{i}(trainindices{i},:) and as test data X{i}(testindices{i},:). There are different X{i} because the n-grams consist of all unigrams, bigrams, and trigrams that occur at least twice in the training data (plus unigram_1 features) and thus depend on the training data.
If you want to know what the features actually are, check textfeats.zip. Note that the first dimension X{i}(1,:) has the feature 'not a known unigram' while the (j+1)-th feature X{i}(j+1,:) has the j-th feature in textfeats{j}.txt
ProsodyFeaturesB4norm 14810x49 5805520 double array ProsodyFeaturesNormalizedByConv 14810x49 5805520 double array prosody_feature_names 1x49 4078 cell array
You want to use the 49 features in ProsodyFeaturesNormalizedByConv (also note that additional scaling data may be needed).
moves_normpitch.zip (91 Mb - link not available owing to bandwidth and copyright regulations - if you already have access via the LDC to the raw data and want this processed version, email me) : this has 64 files of the form q[1-8]nc[1-8].moves.normpitch representing 64 conversations of two speakers, a giver and a follower, who are having a task-oriented dialogue with no eye contact.
(Not publicly available : raw wav files : in wavfiles12/34/56/78.zip, with contents like q1nc2_wav_43.dat having samples for the 43rd utterance in conversation q1nc2. Each zip file is about 180 Mb.)
Each file is of the form
% zvalue_of_momelized_pitch momelized_pitch zvalue_of_raw_pitch raw_pitch_from_esps prob_voicing zvalue_of_local_intensity local_intensity % (actually prob_voicing is the normalized crosscorrelation that ESPS uses to determine the output pitch value - % officiially ESPS gives it as 1 if the raw pitch is 0 and 0 otherwise) % (z values are computed by subtracting the mean and then dividing by the standard deviation... with the % exception of z value for raw pitch, which remains 0 if the raw pitch is 0 i.e. not found) <move num=1 conv=q1nc1 speaker=G label=ready numFrames=103 duration=1.03000 words=[ehm SILENCE right] startframe_esps=0 endframe_esps=102 startframe_wav=0 endframe_wav=20532> 1.03657002837298 239.78 0.0 0.0 0.767700254917 -0.806086071602533 0.0 1.02391681965019 239.12 0.0 0.0 0.33472892642 -0.662385096566971 50.5230903625 1.00282813844553 238.02 0.0 0.0 0.535148739815 -0.308826713784724 174.828872681 0.973303984759003 236.48 0.0 0.0 0.197349309921 0.108718274569066 321.631378174 0.935344358590615 234.5 0.0 0.0 0.472838759422 -0.449386645306214 125.410125732 0.888949259940363 232.08 0.0 0.0 0.609551012516 0.009582051313656 286.776580811 0.834118688808247 229.22 0.977752764603535 236.712051392 0.669899046421 0.198904807185349 353.339599609 ... -0.511358343577427 159.039 -1.74353578208344 94.7677841187 0.707227408886 -0.197000152621103 214.145401001 -0.75276622878568 146.447 -1.66435873898597 98.8977127075 0.389831155539 -0.182989673063866 219.071273804 </move> <move num=2 conv=q1nc1 speaker=G label=instruct numFrames=130 duration=1.30000 words=[you start at the caravan park] startframe_esps=102 endframe_esps=231 startframe_wav=20532 endframe_wav=46376> -1.52206214760331 106.32 -1.66435873898597 98.8977127075 0.389831155539 -0.182989673063866 219.071273804 ...
The first few lines (starting with % or with only whitespace) are comments. After this, each dialog act is enclosed between 'move' tags. Jargon note : 'move' equals 'dialog act'.
For example, consider the first dialog act
The file you want to work with is ngrams.pl.
This file can be used in one of two ways. The first is to create a file of words that can then (in the second usage) be used to create a sparse vector of words for a new dialog act. The first usage is for training data, the second is for test data.
perl ngrams.pl [-o] [-n] [-w] [-s featuresfile to save] [-c listofconvfiles (or conv1 ... convN)]
[-e maximum_relative_entropy] [-f minimum_frequency] [-x maximum_number_of_n_for_which_to_get_ngrams]
This is best explained by example. Suppose you say
perl ngrams.pl -s wordseg.txt -c q1nc1 q1nc2 -x 2
This means that only the files q1nc1.moves.normpitch and q1nc2.moves.normpitch were analyzed. You could also have listed
q1nc1 q1nc2
in a file called, say, trainfiles.txt and said
perl ngrams.pl -s wordseg.txt -c trainfiles.txt -x 2This would be equivalent. Anyway, this command would create the file wordseg.txt with details of all unigrams and bigrams in conversations q1nc1 and q1nc2. If you had wanted trigrams as well, you would have said "-x 3" and if you'd wanted just unigrams you'd have said "-x 1". Anyway, the first few entries of wordseg.txt look like this:
7 0.372599906634556 # [and_SILENCE] 6 0.518396725840501 # [SILENCE_at] 7 0.49787893696661 # [go_straight] 5 0.195092515991592 # [along_the] 15 0.385252917924331 # [SILENCE_just] 8 0.514976383181526 # [mill_SILENCE] 16 0.608035847461469 # [what] 13 0.451076213641237 # [and_you] 23 0.499099348179897 # [SILENCE_NOISE(NONVOCAL)] 56 0.678022537022683 # [going] 8 0.489476457508413 # [down_from] ...
This means, for instance, that the bigram "along the" occurred 5 times in q1nc1 and q1nc2 and that the word "going" occurred 56 times. The second column represents 'relative entropy' the distribution of the ngram in dialog acts of different classes. It is high for ngrams that occur uniformly over different classes of dialog acts, and low for ngrams that occur more often in certain classes. Useful ngrams have high frequency (so they are more likely to occur in test examples) and low relative entropy (as they give more classification-useful information when they occur).
Suppose only want those ngrams that have frequency at least 5 and relative entropy at most 0.5 . Then you'd say
perl ngrams.pl -s wordseg1.txt -c trainfiles.txt -x 2 -e 0.5 -n 5
The order in which you give the flags makes no difference. Anyway, wordseg1.txt looks like :
7 0.372599906634556 # [and_SILENCE] 7 0.49787893696661 # [go_straight] 5 0.195092515991592 # [along_the] 15 0.385252917924331 # [SILENCE_just] 13 0.451076213641237 # [and_you] 23 0.499099348179897 # [SILENCE_NOISE(NONVOCAL)] 8 0.489476457508413 # [down_from] ...
Note that entries like "SILENCE_AT" have disappeared.
There are some more flags you can add that have no arguments.
perl ngrams.pl -s wordseg2.txt -c trainfiles.txt -x 2 -e 0.5 -n 5 -othen wordseg2.txt is just like wordseg1.txt but with the additional lines (scattered in there somewhere)
38 0.256578160355738 # [mmhmm_1] 82 0.418453349387199 # [right_1] 31 0.245150802788593 # [uh-huh_1] 8 0 # [ehm_1] 25 0 # [well_1] 8 0.286797841687923 # [okay_1] 31 0.214567049942517 # [no_1] 40 0.146891851942573 # [yeah_1]
For example, when "well" occurred in a dialog act by itself, that act was always of the same class. The presence of the line
44 0.334643074788771 # [well]
tells us that the 44-25=19 other instances when "well" occurred in a dialog act, it did not provide information that was as useful for classification. Note that only a few of the features created by the -o flag got through the thresholding done by the -n and -e flags.
perl ngrams.pl [-r featuresfile] [-c listofconvfiles (or conv1 ... convN)]
featuresfile is simply the file (e.g. wordseg.txt) created using the -s flag in the training usage.
The -c flag has the same usage as before.
So for example, if you had said in training
perl ngrams.pl -s wordseg5.txt -c q1nc1 q1nc2 -x 2 -n 5 -e 0.5 -o -n -w
And got wordseg5.txt, you could say
perl ngrams.pl -r wordseg5.txt -c q1nc3 q1nc4
to produce the files q1nc3_text.dat and q1nc4_text.dat.
If you would like your output files to be called, say, q1nc3_text_blah.dat instead, just add the flag -t blah.
q1nc3_text.dat looks like
12 1:1 12 1:1 1 129:1 1:2 80:1 2 1:8 148:1 80:1 4 71:1 1:5 9 1:2 43:1 4 1:2 27:1 ...
The structure of this is <classlabel> <index1>:<val1> <index2>:<val2> ... <indexk>:<valk> . Index 1 stands for the feature 'this word did not occur in featuresfile' while otherwise index j+1 stands for the j-th word in featuresfile.
For instance, here, the first two dialog acts of q1nc3 are both of class 12, and have a single word that was not in the 'words' specified by wordseg5.txt . The third dialog act has class label 1, and has the 128th and 79th feature of featuresfile and two words not in featuresfile.
[X,L,trainindices,testindices] = makedataset ;
The result of this is in the workspace maptasknc_textfeats.mat.
L 14810x1 118480 double array X 1x4 6386636 cell array testindices 1x4 118720 cell array trainindices 1x4 355680 cell array
This represents four splits of 14810 data into non-overlapping test and training sets. The i-th split (i=1,2,3,4) uses as training data X{i}(trainindices{i},:) and as test data X{i}(testindices{i},:) . There are different X{i} because the n-grams consist of all unigrams, bigrams, and trigrams that occur at least twice in the training data (plus unigram_1 features) and thus depend on the training data.
The text features are in textfeats.zip.
If you know about Maptask's structure in conversations The data comes from 64 conversations. The test data in the four splits form a partition of the 14810 moves (each data point = a move):