Language Models

Language models (LMs) calculate the probability to see a given sequence of words, as defined through a tokenization algorithm, in a given language or sub-language/domain/genre. For example, an English language model may assign a higher probability to seeing the sequence "How are you?" than to "Wassup' dawg?", and for a hip-hop language model this proportion may be reversed. Language models (LMs) calculate the probability to see a given sequence of words.

There are several use cases for such models:

  • To filter out bad translations in machine translation.
  • To rank speech recognition output.
  • In concept-to-text generation.

More formally, a language model is a stochastic process that models the probability p(w1,,wd) of observing sequences of words w1,,wd. We can, without loss of generality, decompose the probability of such sequences intoWithout loss of generality

p(w1,,wd)=p(w1)i=2dp(wi|w1,,wi1).

This means that a language model can be defined by how it models the conditional probablity p(wi|w1,,wi1) of seeing a word wi after having seen the history of previous words w1,,wi1. We also have to model the prior probability p(w1), but it is easy to reduce this prior to a conditional probability as well. We only need to model how words are generated based on a history.

In practice it is common to define language models based on equivalence classes of histories instead of having different conditional distributions for each possible history. This overcomes sparsity and efficiency problems when working with full histories. In practice it is common to define language models based on equivalence classes of histories.

N-gram Language Models

The most common type of equivalence class relies on truncating histories w1,,wi1 to length n1:

p(wi|w1,,wi1)=p(wi|win,,wi1).
That is, the probability of a word only depends on the last n1 previous words. We will refer to such model as a n-gram language model.

A Uniform Baseline LM

Unigram models are the simplest 1-gram language models. That is, they model the conditional probability of word using the prior probability of seeing that word:

p(wi|w1,,wi1)=p(wi).

To setup datasets and as baseline for more complex language models, we first introduce the simplest instantituation of a unigram model: a uniform language model which assigns the same prior probability to each word. That is, given a vocabulary of words V, the uniform LM is defined as:Given a vocabulary of words V, the uniform LM is defined as:

p(wi|w1,,wi1)=1|V|.

Let us "train" and test such a language model on the OHHLA corpus. First we need to load this corpus. Below we focus on a subset to make our code more responsive and to allow us to test models more quickly.

[BAR] Can 't even call this a blues song [/BAR] [BAR] It 's been so long [/BAR] [BAR] Neither one of us was wrong or anything like that [/BAR] [BAR] It seems like yesterday [/BAR]

We can now create a uniform language model using a built-in constructor. Language models in this book implement the LanguageModel trait.

The most important method we have to provide is probability(word,history) which returns the probability of a word given a history. Let us implement a uniform LM using this trait.Let us implement a uniform LM using this trait.

2.86368843069874E-4

Sampling

It is instructive and easy to sample language from a language model. In many, but not all, cases the more natural the generated language of an LM looks, the better this LM is. The quality of an LM can often be gauged by looking at its samples, but models with poorer samples can still be useful.

To sample from an LM one simply needs to iteratively sample from the LM conditional probability over words, and add newly sampled words to the next history. The only challenge in implementing this is to sample from a categorical distribution over words. Here we provide this functionality via sampleCategorical. You can sample word-by-word, the current word based on previously sampled ones.

" cassettes fuckin' "Yes" zone dirty set breeze glint Conduct

Evaluation

How do we determine the quality of an (n-gram) LM? One way is through extrinsic evaluation: assess how much the LM improves performance on downstream tasks such as machine translation or speech recognition. Arguably this is the most important measure of LM quality, but it can be costly as re-training such systems may take days, and when we seek to develop general-purpose LMs we may have to evaluate performance on several tasks. This is problematic when one wants to iteratively improve LMs and test new models and parameters. It is hence useful to find intrinsic means of evaluation that assess the stand-alone quality of LMs with minimal overhead. Most important for LM quality is their impact on downstream tasks in extrinsic evaluations. This can be expensive to evaluate, hence intrinsic evaluations can be useful.

One intrinsic way is to measure how well the LM plays the "Shannon Game": Predict what the next word in actual context should be, and win if your predictions match the words in an actual corpus. This can be formalized using the notion of perplexity of the LM on a given dataset. Given a test sequence w1,,wT of T words, we calculate the perplexity PP as follows: The perplexity of an LM on a sample w1,,wT is a measure of intrinsic quality:

PP(w1,,wT)=p(w1,,wT)1T=iT1p(wi|win,,wi1)T

We can implement a perplexity function based on the LanguageModel interface.

Let's see how the uniform model does on our test set.

Infinity

Out-of-Vocabularly Words

The problem in the above example is that the baseline model assigns zero probability to words that are not in the vocabulary. Test sets will usually contain such words, and this leads to the above result of infinite perplexity.The model assigns 0 probability to words not in the training vocabulary.

Vector
  1. Tuple2
    • _1 underground
    • _2 0.0
  2. Tuple2
    • _1 metaphors
    • _2 0.0
  3. Tuple2
    • _1 scrape
    • _2 0.0

The Long Tail

The fact that we regularly encounter new words in our corpus is a common phenomenon not specific to our corpus. Generally we will see a few words that appear repeatedly, and a long tail of words that appear a few times. While each individual long-tail word is rare, the probability of seeing any long-tail word is quite high (the long tail covers a lot of the frequency mass). A few words appear repeatedly, and a long tail of words appear a few times but often in aggregate.

Let us observe this phenomenon for our data: we will rank the words according to their frequency, and plot this frequency against the rank.

1000.002000.003000.00X1.003492.00200.00400.00600.00800.001000.001200.001400.001600.001800.00Y1.001859.00Label

In log-space such rank vs frequency graphs resemble linear functions. This observation is known as Zipf's Law, and can be formalized as follows. Let rw be the rank of a word w, and fw its frequency, then we have: Zipf's Law states that word frequency fw is inversely proportional to word rank rw:

fw1rw.

Inserting Out-of-Vocabularly Tokens

The long tail of infrequent words is a problem for LMs because it means there will always be words with zero counts in your training set. There are various solutions to this problem. For example, when it comes to calculating the LM perplexity we could remove words that do not appear in the training set. This overcomes the problem of infinite perplexity but doesn't solve the actual issue: the LM assigns too low probability to unseen words. Moreover, the problem only gets worse when one considers n-gram models with larger n, because these will encounter many unseen n-grams, which, when removed, will only leave small fractions of the original sentences. The long tail of infrequent words is a problem for LMs and NLP in general. It gets even worse for n-grams.

The principled solution to this problem is smoothing, and we will discuss it in more detail later. Before we get there we present a simple preprocessing step that generally simplifies the handling of unseen words, and gives rise to a simple smoothing heuristic. Namely, we replace unseen words in the test corpus with an out-of-vocabularly token, say OOV. This means that LMs can still work with a fixed vocabularly that consists of all training words, and the OOV token. Now we just need a way to estimate the probability of the OOV token to avoid the infinite perplexity problem. One solution is to replace unseen words with an OOV token in the test set.

Vector
  1. [BAR]
  2. [J-Live]
  3. [/BAR]
  4. [BAR]
  5. For
  6. [OOV]
  7. [OOV]
  8. [/BAR]
  9. [BAR]
  10. You

A simple way to (heuristically) estimate the OOV probability is to replace the first encounter of each word in the training set with the OOV token. Now we can estimate LMs as before, and will automatically get some estimate of the OOV probability. The underlying assumption of this heuristic is that the probability of unseen words is identical to the probability of encountering a new word. We illustrate the two operations of this method in the code below. To get probability estimates for these tokens we can replace the first encounter of each word in the training set with OOV.

Vector
  1. [OOV]
  2. A
  3. [OOV]
  4. B
  5. A

Now we can apply this to our training and test set, and create a new uniform model.

1244.9999999982847

Training Language Models

The uniform LM is obviously not good at modelling actual language. To improve upon this baseline, we can estimate the conditional n-gram distributions from the training data. To this end let us first introduce one parameter θw,h for each word w and history h of length n1, and define a parametrized language model pθ: Let us define a parametrized language model pθ:

pθ(w|h)=θw,h

Training an n-gram LM amounts to estimating θ from some training set Dtrain=(w1,,wn). One way to do this is to choose the θ that maximizes the log-likelihood of Dtrain:

θ=argmaxθlogpθ(Dtrain)

As it turns out, this maximum-log-likelihood estimate (MLE) can calculated in closed form, simply by counting:

θw,h=#Dtrain(h,w)#Dtrain(h)

where

#D(e)=Count of e in D

Many LM variants can be implemented simply by estimating the counts in the nominator and denominator differently. We therefore introduce a trait for such count-based LMs. This will help us later to implement LM variants by modifying the counts of a base-LM. Many LMs can be implemented by estimating the counts in the nominator and denominator differently, so let's define a corresponding trait

Let us use this to code up a generic NGram model

Let us train a unigram model.

[OOV][/BAR][BAR],theIyouto.a0.00.10.00.20.180.100.100.050.030.020.020.020.010.01

The unigram LM has substantially reduced (and hence better) perplexity:

88.20631586915746

Let us also look at the language the unigram LM generates.

me [OOV] "The Ayo , [OOV] [BAR] ? style on

Bigram LM

The unigram model ignores any correlation between consecutive words in a sentence. The next best model to overcome this shortcoming is a bigram model. This model conditions the probability of the current word on the previous word. Let us construct such model from the training data. The Bigram model conditions the probability of the current word on the previous word.

[OOV]'mcangetgotwasgottajustdondo0.00.10.20.00.20.220.220.040.040.030.030.030.030.010.01

You can see a more peaked distribution conditioned on "I" than in the case of the unigram model. Let us see how the bigram LM generates language. Let us see how the bigram LM generates language.

[BAR] But if I got [/BAR] [BAR] But now to the

Does the bigram model improve perplexity?

Infinity

Unfortunately the bigram model has the problem we tried to avoid using the OOV preprocessing method above. The problem is that there are contexts in which the OOV word (and other words) hasn't been seen, and hence it receives 0 probability. While every word has been seen, it may not have been seen in every context.

0.0

Smoothing

The general problem is that maximum likelhood estimates will always underestimate the true probability of some words, and in turn overestimate the (context-dependent) probabilities of other words. To overcome this issue we aim to smooth the probabilities and move mass from seen events to unseen events.

Laplace Smoothing

The easiest way to overcome the problem of zero probabilities is to simply add pseudo counts to each event in the dataset (in a Bayesian setting this amounts to a maximum posteriori estimate under a dirichlet prior on parameters). Add pseudo counts to each event in the dataset.

θw,hα=#Dtrain(h,w)+α#Dtrain(h)+αV

Let us implement this in Scala.

7.968127490039841E-4

This should give a better perplexity value:

68.15799924036281
Exercise 2
Can you find a better pseudo-count number?

Adjusted counts

It is often useful to think of smoothing algorithms as un-smoothed Maximum-Likelhood estimators that work with adjusted n-gram counts in the numerator, and fixed history counts in the denominator. This allows us to see how counts from high-frequency words are reduced, and counts of unseen words increased. If these changes are too big, the smoothing method is likely not very effective. Good to think of smoothing as moving mass and adjusting the counts in the numerator.

Let us reformulate the laplace LM using adjusted counts. Note that we since we have histories with count 0, we do need to increase the original denominator by a small ϵ to avoid division by zero.

#Dtrain,α(h,w)=θw,hα(#Dtrain(h)+ϵ)#Dtrain,α(h)=#Dtrain(h)+ϵ

Tuple2
  • _1 625.0
  • _2 603.5748098741531

We see above that for high frequency words the absolute counts are altered quite substantially. This is unfortunate because for high frequency words we would expect the counts to be relatively accurate. Can we test more generally wether our adjusted counts are sensible? Can we test more generally wether our adjusted counts are sensible?

One option is to compare the adjusted counts to average counts in a held-out set. For example, for words of count 0 in the training set, how does their average count in the held-out set compare to their adjusted count in the smoothed model? Compare adjusted counts to counts in a held out set:

train-count test-count smoothed-count
0.0 0.003623312785682409 0.005007694407734
1.0 0.4578539563277765 0.3189338695509523
2.0 1.2009237875288683 0.7719376450920822
3.0 1.8906666666666667 1.2309497704843773
4.0 3.0714285714285716 1.6623618386822872
5.0 4.366336633663367 2.124690216711694
6.0 4.068965517241379 3.0709006718308984
7.0 5.365384615384615 3.7380632119752257

Interpolation

For a given context the smoothing methods discussed above shift mass uniformly across the words that haven't been seen in this context. This makes sense when the words are not in the vocabularly. However, when words are in the vocabularly but just have not been seen in the given context, we can do better because we can leverage statistics about the word from other contexts. In particular, we can back-off to the statistics of n1 grams. When we don't have reliable n-gram counts we should use n1-gram counts.

Tuple2
  • _1
    Tuple2
    • _1 [BAR]
    • _2 serious
  • _2 0.09532115739790684

A simple technique to use the n1 gram statistics is interpolation. Here we compose the probability of a word as the weighted sum of the probability of an n-gram model p and a back-off n1 model p:

pα(wi|win,,wi1)=αp(wi|win,,wi1)+(1α)p(wi|win+1,,wi1)

Let us interpolate between a bigram and uniform language model, varying the interpolation parameter α.

0.200.400.600.80X0.000.9550.0055.0060.0065.0070.0075.0080.0085.00Y49.3388.21Label

Backoff

Instead of combining probabilities for all words given a context, it makes sense to back-off only when no counts for a given event are available and rely on available counts where possible.

A particularly simple, if not to say stupid, backoff method is Stupid Backoff. Let w be a word and hn be an n-gram of length n:

pStupid(w|hn)={#Dtrain(hn,w)#Dtrain(hn)=if #Dtrain(hn,w)>0pStupid(w|hn1)otherwise

It turns out that the Stupid LM is very effective when it comes to extrinsic evaluations, but it doesn't represent a valid probability distribution: when you sum over the probabilities of all words given a history, the result may be larger than 1. This is the case because the main n-gram model probabilities for all non-zero count words already sum to 1. The fact that the probabilities sum to more than 1 makes perplexity values meaningless. The code below illustrates the problem.

1.0709880976810933

The are several "proper backoff models" that do not have this problem, e.g. the Katz-Backoff method. We refer to other material below for a deeper discussion of these.

Exercise 3
Develop and implement a version of the stupid language model that provides probabilities summing up to 1.

Background Reading

  • Jurafsky & Martin, Speech and Language Processing: Chapter 4, N-Grams.
  • Bill MacCartney, Stanford NLP Lunch Tutorial: Smoothing