Tone RecognitionPicturesShown below is a screenshot of some Mandarin tones from news broadcast speech using Praat. ![]() Data formatPWSWe assume that a sound file has several phrases, each of which has several words, each of which has several syllables, and that there is one tone per syllable. We store this in a format we call PWS (phrase-word-syllable) format. By 'phrase' we mean something corresponding to a breath group or intermediate phrase - for example, if we wish to do pitch contour regression normalization (as in Levow's Eurospeech '05 paper), we would do it on a phrase. If X is a PWS structure, then it has the fields below
.filename : adding 'wav' to this should produce an existing wav file.
.phrases : a 1xP cell array if the file has P phrases.
.sr : sample rate of the wav file
Suppose the i-th phrase has Wi words; then X.phrases{i} has the fields below. Note that a word can be 'valid' or 'invalid'. Invalid words don't have enough information, and should be ignored - they can even consist of multiple words. Note that if the whole phrase is invalid, then it should contain one invalid word whose start and end times are those of the phrase. There can be silence (SIL) between words, even if they are within the same phrase.
.validword : a 1 x Wi matrix that is 1 if the word really is a word and all information is available for its syllables and phonemes, and 0 otherwise.
.startwords : a 1 x Wi matrix with the start times of the Wi syllables (available for both valid and invalid words)
.endwords : a 1 x Wi matrix with the end times of the Wi syllables (available for both valid and invalid words)
.wordnames : a 1 x Wi cell array of strings with the word names ('0' for invalid words)
.words : a 1 x Wi cell array of structs (defined below, and only present for valid words)
Suppose the j-th word of the i-th phrase has Sij syllables and Pij phones (=phonemes); then X.phrases{i}.words{j} has the following fields (unless X.phrases{i}.valid is 0, in which case the struct will be empty)
.name : a string with the current word (= X.phrases{i}.wordnames{j})
.numwinphrase : number of words in this phrase (= Wi)
.poswinphrase = j
.sylls : a 1 x Sij cell array with strings of this words' syllables
.tones : a 1 x Sij matrix with tones of this words' syllables
.startsylls : a 1 x Sij matrix with start times of this words' syllables
.endsylls : a 1 x Sij matrix with end times of this words' syllables
.phones : a 1 x Pij cell array with this words' phones
.startphones : a 1 x Pij matrix with start times of this words' phones
.endphones : a 1 x Pij matrix with end times of this words' phones
.startphonesPI : a 1 x Sij matrix with start indices (in the phones field) of this words' phones
(see conditions below)
.endphonesPI : a 1 x Sij matrix with end indices (in the phones field) of this words' phones
We have the following conditions, which make the storing of startsylls and endsylls redundant:
(1) the phones of the the s-th syllable are in phones(startphonesPI(s):endphonesPI(s))
(2) startsylls(s) = startphones(startphones(s)) and
(3) endsylls(s) = endphones(endphones(s))
TVSTVS objects hold values for one type of value e.g. pitch or intensity or spectral quality, for a sound file (or several sound files) and details of how to use those values to produce fixed-length measures for each syllable. If Y is a TVS object then it has the fields below :
.tv : This is either
- a 2-column matrix of times (in seconds) and NONZERO values, OR
- a cell array of such matrices (each matrix corresponding to the entry for one PWS object)
.name : name of the value e.g. 'pitch'
.k : nonnegative integer, to be used with features 'contour' and 'diffcontour' below.
.minnumseg : minimum number of segments with measured values for a syllable to be valid
.mindurseg : minimum duration of segments with measured values for a syllable to be valid
.useonset : 1 or 0 (default) - take into account measurements of each syll's onset
.features : a cell array with features, e.g. {'contour','diffcontour','maximum','mean','slope','slope2','range'}
the default value is {'contour'} for each syllable
'contour' : create a k-length feature that has a contour with k samples (only works if k>1)
'diff' or 'diffcontour': create a (k-1)-length feature with a contour of differences (only works if k>1)
'mean' : create a scalar feature with the mean of this value for each syllable. The mean is taken over
a 20-sample interpolation of the syllable (including the onset part of the syllable if useonset=1)
'maximum' : create a scalar feature with the maximum of this value for each syllable
'minimum' : create a scalar feature with the minimum of this value for each syllable
'range' : create a scalar feature with the maximum-minus-minimum of this value for each syllable
'slope' : create a scalar feature with the slope of the value contour for this syllable
'slope2' : create a scalar feature with the slope of the value contour for the second half of this syllable
All but the first two fields above have to do specifically with how to use values for each syllable. Example DataThe following zip files are for twenty VOA Mandarin files. TextGrid from Levow's forced alignments (not checked) TextGrid, modified after manual checking and adjustments of forced alignments. Missing words in the middle of a phrase are marked with validity 0. Link to documentation page, including data and processing files, used in the paper Additional Features for Mandarin Speech Recognition. Matlab filesphwd2pws.m - converts a file of phoneme and word alignments (e.g. 19980630.0700.0032.ph, 19980630.0700.0032.wd - use with sample rate = 8000) to a PWS structure. pws2textgrid.m : creates Praat-formatted TextGrid files, with three tiers for Phrase, Word, and Syllable, of a PWS structure (or a cell array thereof). Needs findcell.m. textgrid2pws.m : reads Praat-formatted TextGrid files into a PWS structure. Good for replacing automatic alignments with manually corrected alignments. (Remember to press SHIFT while doing the manual alignment in Praat so that boundaries across different tiers remain aligned!) pwsnormtv.m : Do phrase-by-phrase normalization (in any of several different ways) of a 2-column matrix of times and values for a sound file using a PWS struct. pwstv2seq.m : Given a PWS struct and a corresponding TVS struct, create a SEQ struct. makepitchtier.m : Given a 2-column matrix of times and values, a Praat-format PitchTier is created. pwsseqfeatspitchtier.m : Given a PWS struct and a SEQ struct and a features struct (as output by pwstv2seq.m), create a Praat-format PitchTier file for each feature. PicturesThe following pictures show some problems with recognizing coarticulated tones in continuous speech. These were made with Praat (Boersma == God), Gina's forced phoneme-syllable alignments, and the Matlab files above. Target DelaySometimes the correct tone target is reached, but not at the end of the syllable.
Changes in pitch range
|