DCN Syllabification

by Jeremy O’Brien

University of Chicago

 

based on the research of

John Goldsmith, University of Chicago

 and Gary Larson, Wheaton College

 

Documentation and How-To

 

 

Introduction:

DCN Syllabification is a computational model of sonority and syllables. The model is built on the idea of Dynamic Computational Networks, or DCNs. A DCN is an artificial intelligence computer network that can be used to model linguistic phenomena, including stress (DCN Stress) and sonority/syllabification. Both DCN Stress and DCN Syllabification are built into Linguistica.

For more information on DCNs, see John Goldsmith’s website. Furthermore, if you are interested you should visit Max Bane’s DCN website. Max has been doing some fantastic recent research on DCNs, in collaboration with myself and Jason Riggle, among others.

 

Downloading the program:

     DCN Stress is part of Linguistica. Download the Linguistica application for your operating system (Windows, Mac OS X, or Linux).

 

How to use:

(Windows version: Make sure the qt dll is in the same folder as the executable file.)

 

 

            Open the application. In the main window, there will be a set of tabs, labeled Command Line, Graphic Display, DCN Stress, and DCN Syllabification. Go to the fourth tab, DCN Syllabification.

 

Training and Testing Corpora:

            The training and testing corpora must be encoded in a specific manner. For every line there must be a word in phonetic/phonological transcription, followed by a tab, followed a code that tells which segment is a syllable nucleus, followed by a tab, followed by anything else. For instance, the token for ‘computer’ might be:

 

phonetics          syllable information       (optional) orthography or notes

kəmpjutər         LHOLOHLHL computer

 

The transcription does not necessarily have to be in IPA. The most important thing to remember is that the transcription must be consistent, and it should have a one-to-one correspondence between grapheme and phone/phoneme. A very close transcription might be unhelpful—a semi-phonemic transcription might be the easiest and also might give the best results.

The syllable information is the most unusual part of the corpus. The ‘H’ means that it is a high point in the sonority wave, meaning it is the nucleus of the syllable. In this model of sonority, the high points represent syllable nuclei, and the low points represent the syllable boundaries. So, for a word like ‘computer’, it starts out low ‘L’, then has a local maximum ‘H’ with the shwa, then goes down a little (‘O’ for other) at the m, then hits a local minimum with the p. When you encode a corpus, the only important part is the ‘H’—the network only pays attention to local maxima when it is training, and it only pays attention to capital H’s when it is reading in the corpus. You can use any letter or punctuation for non-maxima, like ‘O’ for other. Just make sure that the transcription is the same number of characters as the syllable information, and that the H’s of the syllable information match the nuclei of the syllables of the word. (Notice how the shwas and the u correspond to the H’s in the example.)

The training corpus will be used to train the network to recognize how to syllabify in a particular language variety. The testing corpus will be used to show the output of the network. For this reason, it is probably a good idea to divide up your total corpus into two parts—one part the network will train on, and the other part will be used to make sure it can syllabify new data.

 

Training:

Click on the ‘Train Network’ button. The network will train on the training corpus—it might take a few moments. Eventually a dialog box will tell you if the algorithm was successful or not. If the algorithm is unsuccessful, you might want to alter the parameters and try again, or it could be the case that your corpus is un-learnable by this version of the training algorithm.

Regardless of the success of the training algorithm, the values that it learned will be displayed in the Values Learned text box. The values for alpha and beta will be shown, as well as the inherent sonority of each character it encountered in the training corpus. In the Results of Training text box, the syllabification of each word in the testing corpus will be shown. For each token, the first line will show the word and the syllable information (if any) given in the testing corpus. The second line will show the word, with periods (.) marking where the model thinks the syllable boundaries should be. The syllable information, with local minima and maxima, is also given.

            The default Parameters might be sufficient for you, but you will probably want to alter them. A discussion of what each value corresponds to is in the Discussion section.

            A log file (DCNsyllog.txt) is created in the same folder as Linguistica. This file contains information on the most recent run of the training algorithm. It lists all the appropriate values for each iteration of the training algorithm. This information gives a great deal of insight into how the algorithm behaves in particular situations. The file can also be very large, so an application for opening large files might be necessary (e.g. WordPad for Windows or TextEdit for Mac OS X).

 

Discussion:

            The training algorithm is a simplified version of Gary Larson’s DCN learning algorithm, as explained in his 1992 dissertation. This algorithm uses the concept of simulated annealing – a metaphor from metallurgy that gives a search strategy to find (with some luck) the global maximum. We use the idea of Temperature, as a quantitative measure for how sure we are that we have the right values. The higher the temperature, the more likely we are to make larger changes of the values. When the temperature nears zero, we cool off, allowing us only to make tiny changes of the values.

            Each segment is assigned its own “inherent sonority”. This is a learned value that represents how much the segment likes to be a syllable nucleus—vowels are more likely to be syllable nuclei in most languages that stops, so in general, the training algorithm should give the vowels of the training corpus high sonority values.

            The “inherent sonority” is then used as a base level for the sonority of the segment. The outcome of the sonority of a segment is affected by the sonority of the neighboring segments, the values and signs of alpha and beta, and even the values of segments on the other side of the word. The network settles at an equilibrium: the segments of the word now have a different sonority than the inherent sonority of that character. The peaks and troughs of this sonority curve represent the syllable nuclei and the syllable boundaries, respectively.

            Below is the pseudocode for the learning algorithm, using the variables from the Parameters fields shown above in the screenshot.

 

Repeating for # of Trials

  Start with Starting alpha, Starting beta, and Initial Temperature

    Repeating for Max Steps Per Trial or until T is very small

      Take a word from the corpus

      If any of the characters are new, assign a random value btw .5 & 1

      Using alpha and beta, see if the network predicts the syl. nuclei

      If incorrectly predicted:

         If a segment is assigned as a nucleus, but is not

            Decrease that segment’s sonority by .1

         If a segment is not assigned as a nucleus, but is supposed to be

            Increase that segment’s sonority by .1

         Change alpha and beta each by (T * random number from -.5 to .5)

         T := T + Add When Wrong

      Otherwise, if correctly predicted:

         T := T * Multiply by When Correct

 

            By including all the above bolded parameters in the program window, the user is able to manipulate the fine points of the algorithm while not having to alter the source code.

 

Further Questions:

            Please feel free to contact me by email (address at the top of the page). I am open to comments and suggestions from anyone interested in DCNs and computational models of human language.