Statistical Models for Text Segmentation

Doug Beeferman, Adam Berger, John Lafferty, 1999.

Dinoj Surendran's working notes for presenting this paper in Gina Levow's Discourse & Dialogue class on 26 Oct 2004

This paper can be found on citeseer at http://citeseer.ist.psu.edu/beeferman99statistical.html. The images on this page are from there as well.


Problem: Given stream of text (e.g. newswire), find topic boundaries.

Available training data: large corpus of newswire text with annotated story boundaries. Can also use larger corpus of text without any story boundaries to estimate trigram probabilities.


Outline

Framework of Algorithm

Remember that text is coming through in a stream of tokens t1, t2,..., tn. These tokens can be, for example, words and punctuation.

There is also a stream of labels (known in training, to be predicted in testing) b1, b2,..., bn. The i-th label bi is 1 if the i-th token is at the start of a new topic, and 0 otherwise.

To each token is a context, consisting of it and the K (here, K=500) words before and after it. Xi denotes the 2K+1-token window surrounding token ti. C is the set of all contexts.

The aim is to create a function q:C-> R that predicts based on Xi that is high if the algorithm thinks bi is 1 and low if it thinks bi is 0.

q is created by combining several features. Each feature is also a function from C to {0,1}. To combine features in q, we give each feature a weight. We give the k-th feature Fk weight wk. The combination procedure is shown below.

q(X) = (1/Z) productk:Fk(X)=1 wk

Z is a normalizing constant - don't worry about it (unless you have to implement this algorithm, in which case this will be only one of many other things to worry about.)

All weights are positive. An uninformative feature has weight 1. A feature suggesting a boundary has weight more than 1. A feature suggesting the absence of a boundary has weight between 0 and 1 (exclusive).

Once you have q, remember that it produces values on each token. (Or, if sentence boundaries are known, on each sentence.) A peak-picking procedure must be defined to posit the presence of a boundary. This is of the form "place a boundary at ti if q(Xi) > A and there is no larger value of q on the surrounding E positions. A and E are constants determined from training data.

Algorithm 1 in the paper shows how f features can be selected from a larger set of candidate features, and how good weights (i.e. the w's or L) for these features can be estimated. It is not very efficient, so Algorithm 2 speeds it up by adding several candidate features at a time. Both are greedy algorithms. Algorithm 2 also does more work, removing features that are highly correlated with other features and thus provide little additional information.

They select f=100 features out of hundreds of thousands of candidate features. Their procedure takes hours. (It's surprising that's it's not days.)

We skip the technical discussions of these feature selection and weight optimization algorithms, and move on to the question of what kind of features are considered.

Cue-based features

W is the set of all words. In practice it's the set of several thousand common words, and all other words are lumped into a single dummy word.

We have a large set of candidate features: W x [-10,10] x words/sentence x Presence/Absence. Examples"

For each feature we have a score.

Below is a table from the paper with some example cue features generated after training with the Wall Street Journal corpus.

The first feature shows that if the word "incorporated" appears in the first sentence (after the current token) then the probability of a sentence boundary at the token (i.e. the sentence with "incorporated" is the first sentence of a new topic) is multiplied by 4.5. If this is the i-th feature, then wi=4.5 (and Li = loge 4.5 = 1.5041). This makes sense when you realize that words like 'incorporated' tend to only appear the first time a company is named.

Another feature (fourth from bottom) shows that if the next five words contains 'he', then the probability of a boundary goes down about 12 times (i.e. it's multiplied by 0.082). This makes sense because "he" is anaphora - whatever "he" is referring to must have have enough time to occur, and five words isnt really enough time.

If sentence boundaries are known, cues can be defined on sentences as well, by applying them to just the first word of the sentence. I'm not sure if that's what they do.

Topicality features

These take into account vocabulary changes across topics.

They define a measure on tokens (it can be generalized to work on sentences) that tends to increase the further one goes into a story, and suddenly drop when a new story starts. The call this measure topicality, and define features of the form "topicality is between a and b over the next k sentences".

The topicality measures how differently two language models M1 and M2 predict the next word based on previous words. M1 and M2 adapt at varying rates to new vocabulary - the assumption being that two successive stories always come from domains with very different vocabularies.

Recall that the text is assumed to be a sequence of tokens t1...tN. Suppose you have seen n-1 tokens so far, and are trying to predict the n-th token using them. The two models give different probabilities that the next token is whatever it is. Pi(tn | t1...tn-1) is defined to be the probability according to model Mi.

Topicality is defined to be their ratio:

T(tn | t1...tn-1) := P1(tn | t1...tn-1) / P2(tn | t1...tn-1)

Language Model 1

A second-order Markov model.

P1(tn | t1...tn-1) = P1(tn | tn-2tn-1)

This is about equal (it would be equal if there was no smoothing) to the number of times tn-2tn-1tn appears in the training corpus divided by the number of times tn-2tn-1 appears.

To predict a new word, only information within a sentence (and from the training corpus) is used. Therefore this model will do just as well predicting words in sentences at the start of a topic as it will on predicting words in sentences well into the topic.

This model is referred to in the paper as the trigram or short-range model.

Language Model 2

The trigram model, but extended with triggers from words in recent history (in this paper, that's the previous 500 words).

Suppose the recent history of a word w in the datastream is X. Q(w|X) is the amount by which w is more likely to appear given X (as compared to just the probability given by the trigram model). If there are words in X that are typically highly correlated with w (e.g. w = "Sox" and X contains "Yankees") then Q(w|X) is higher than 1.

P2(tn | t1...tn-1) = Q(tn | X) P1(tn | t1...tn-1) / [ sumw in W Q(w | X) P1(w | t1...tn-1) ]
= P2(tn | X, tn-2tn-1) = Q(tn | X) P1(tn | tn-2tn-1) / [ sumw in W Q(w | X) P1(w | tn-2tn-1) ]

Note the really annoying normalization term that ensures that P2 is a probability.

Now to define Q(w|X). For every u in W, there is a parameter Luw that defines how much an appearance of u in X increases the chance of w appearing after X. Thus

Q(w|X) = exp (sumu in W : u appears in XLuw) = exp (sumu in WnX Luw)

The point is that P2(tn | t1...tn-1) tends to be higher the further tn is in a topic.

Model 1 / Model 2

Their measure of topicality T of the n-th word is

T(tn|X) = log (P2(tn|t1...tn-1) /P1(tn|t1...tn-1))

(That's all you need to know, but if you want to work it out in full, it's this :

T(tn|X) = log ((P2(tn|X,tn-2tn-1) / P1(tn|tn-2tn-1)) = log Q(tn|X) - log [sumw in W Q(w | X) P1(w | tn-2tn-1)]

)

T(tn|X) increases the further into a topic tn is, and drops when a new topic starts.

We make the further assumption that sentence boundaries are known, and define T on sentences rather than words. This is done by taking its mean (geometric?) across the 3-rd (presumably), 4-th, 5th,..., and final word in the sentence.

Evaluation Metric

Suppose we know from training data that the average length of a story is 2k. Suppose an algorithm A partitions a test corpus with n tokens into what it thinks are topics, and we have an external source of information that knows what the correct topic boundaries are.

The paper uses Doddington's Pk metric to evaluate the perfomance of A as follows. It considers all n-k pairs of words in the test corpus that are k words apart, and counts how many of them are actually in the same topic, and how many of them A says are in the same topic.

Pk(A) is the fraction of the n-k pairs where the correct answer agrees with A.

Pk(A) can be split into two terms that measure when A says two words are in the same topic when they arent, and vice versa.

In practice Pk(A) produces values that are somewhere between the false positive and false negative rate.

Comparison with other methods

The investigate their method using cue-only & topicality-only features. Topicality-only does badly. Cue-only does much better, but is still improved by adding topicality.

They compare their method with their implementations of Text Tiling, HMMs and Decision Trees. It beats them all (of course) but can be improved further by combining it with the decision tree.