Phonological and Syntactic information

The following graph shows a plot of phonological information (- log (probability that phonemes produce a word with a n-order Markov process)) versus syntactic information (- log (probability of random word being this word)) for n = 0. The language is English. The words considered in each calculation have at least n phonemes.

This seems to indicate that words of high syntactic information (infrequent words) are words with high phonological information (and are therefore words with a lower probability of being formed), i.e. words with a higher probability of being formed tend to be frequent words. The correlation is quite low however, 0.276. It says very little about words of high frequency or words with low probability of being formed.

Note that there is no reason for syntactic information to be more or less than phonological information; consider P(ab) and P(a)P(b|a) in the two languages (represented by (word,frequency) pairs) { (b,1), (ab,2), (bac,3) } and { (b,1), (ab,2), (abc,3) }. In the first language P(ab) = 2/6 > 2/6 x 2/5 = P(a)P(b|a), and in the second P(ab) = 2/6 < 5/6 x 1 = P(a)P(b|a).

The following graph shows the same plot, but the points this time are not all words of the language (English) but the 1000 most frequent.

The most frequent words generally seem to have the same distribution as the rest of the words.

Next, we assume that n=1.

n=2:

n=3:

n=4.

For comparison, we repeat the calculation for n=4, for the 1000 most common words (of length at least 4). This time the difference from the main plot is more visible, showing that more points are closer to the line "phonological information = syntactic information".

n=8. Note that the number of data points in each plot is decreasing, as we consider only words with at least n phonemes.

n=12.

Finally, we assume that n=16.

As n increases, the phonological information approaches the syntactic information. This is because since for words of length n, Prob(phonemes form word) = Prob(initial n phonemes are those of word) = Prob(word). The probabilities are similar for words of length close to n. The percentage of words of length at least n with length close to n increases as n increases. Another consequence of this is that since more frequent words are generally shorter, the closeness of phonological information to syntactic information is more pronounced for words of higher frequency (and lower syntactic information).

As n (the order of the Markov model) increases, the phonological information tends to be at most the syntactic information. This is because for words of the form a1a2a3...am (m > n), P(a1a2...an) = P(probability that a word begins with a1...an) tends to be only a little larger than P(a1a2..am) while P(a(n+1)...am | a1...an ) remains relatively small, since a1..an here is a n-gram that can occur anywhere in the word, not just at the start. Therefore P(a1a2..an) x P(a(n+1)...am | a1a2..an) tends to be smaller than P(a1a2...am)

Similar results hold for Dutch and German. Below is the plot for German and n=8. There are more data points since German words are longer, but no other apparent differences.

Conclusion: Words with a higher probability of being formed are usually frequent words. As the frequency of words and the order of the Markov process used to generate words from phonemes increases, the phonological information approaches the syntactic information, indicating that words with a low probability of being formed are more likely to be infrequent. However, sparsity becomes an issue in this case, particularly as n approaches the maximum length of words.