Language models (LMs) calculate the probability to see a given sequence of words.
There are several use cases for such models:
Without loss of generality
We only need to model how words are generated based on a history.
In practice it is common to define language models based on equivalence classes of histories.
The most common type of equivalence class relies on truncating histories to length :
Unigram models are the simplest 1-gram language models. That is, they model the conditional probability of word using the prior probability of seeing that word:
Given a vocabulary of words , the uniform LM is defined as:
Let us "train" and test such a language model on the OHHLA corpus. First we need to load this corpus. Below we focus on a subset to make our code more responsive and to allow us to test models more quickly.
We can now create a uniform language model using a built-in constructor. Language models in this book implement the LanguageModel
trait.
Let us implement a uniform LM using this trait.
The quality of an LM can often be gauged by looking at its samples, but models with poorer samples can still be useful.
You can sample word-by-word, the current word based on previously sampled ones.
Most important for LM quality is their impact on downstream tasks in extrinsic evaluations. This can be expensive to evaluate, hence intrinsic evaluations can be useful.
The perplexity of an LM on a sample is a measure of intrinsic quality:
We can implement a perplexity function based on the LanguageModel
interface.
Let's see how the uniform model does on our test set.
The model assigns 0 probability to words not in the training vocabulary.
A few words appear repeatedly, and a long tail of words appear a few times but often in aggregate.
Let us observe this phenomenon for our data: we will rank the words according to their frequency, and plot this frequency against the rank.
Zipf's Law states that word frequency is inversely proportional to word rank :
The long tail of infrequent words is a problem for LMs and NLP in general. It gets even worse for n-grams.
One solution is to replace unseen words with an OOV
token in the test set.
To get probability estimates for these tokens we can replace the first encounter of each word in the training set with OOV
.
Now we can apply this to our training and test set, and create a new uniform model.
Let us define a parametrized language model :
Training an n-gram LM amounts to estimating from some training set . One way to do this is to choose the that maximizes the log-likelihood of :
As it turns out, this maximum-log-likelihood estimate (MLE) can calculated in closed form, simply by counting:
where
Many LMs can be implemented by estimating the counts in the nominator and denominator differently, so let's define a corresponding trait
Let us use this to code up a generic NGram model
Let us train a unigram model.
The unigram LM has substantially reduced (and hence better) perplexity:
Let us also look at the language the unigram LM generates.
The Bigram model conditions the probability of the current word on the previous word.
Let us see how the bigram LM generates language.
Does the bigram model improve perplexity?
While every word has been seen, it may not have been seen in every context.
The general problem is that maximum likelhood estimates will always underestimate the true probability of some words, and in turn overestimate the (context-dependent) probabilities of other words. To overcome this issue we aim to smooth the probabilities and move mass from seen events to unseen events.
Add pseudo counts to each event in the dataset.
Let us implement this in Scala.
This should give a better perplexity value:
Good to think of smoothing as moving mass and adjusting the counts in the numerator.
Let us reformulate the laplace LM using adjusted counts. Note that we since we have histories with count 0, we do need to increase the original denominator by a small to avoid division by zero.
Can we test more generally wether our adjusted counts are sensible?
Compare adjusted counts to counts in a held out set:
train-count | test-count | smoothed-count |
0.0 | 0.003623312785682409 | 0.005007694407734 |
1.0 | 0.4578539563277765 | 0.3189338695509523 |
2.0 | 1.2009237875288683 | 0.7719376450920822 |
3.0 | 1.8906666666666667 | 1.2309497704843773 |
4.0 | 3.0714285714285716 | 1.6623618386822872 |
5.0 | 4.366336633663367 | 2.124690216711694 |
6.0 | 4.068965517241379 | 3.0709006718308984 |
7.0 | 5.365384615384615 | 3.7380632119752257 |
When we don't have reliable -gram counts we should use -gram counts.
A simple technique to use the gram statistics is interpolation. Here we compose the probability of a word as the weighted sum of the probability of an -gram model and a back-off model :
Let us interpolate between a bigram and uniform language model, varying the interpolation parameter .
Instead of combining probabilities for all words given a context, it makes sense to back-off only when no counts for a given event are available and rely on available counts where possible.
A particularly simple, if not to say stupid, backoff method is Stupid Backoff. Let be a word and be an n-gram of length :
It turns out that the Stupid LM is very effective when it comes to extrinsic evaluations, but it doesn't represent a valid probability distribution: when you sum over the probabilities of all words given a history, the result may be larger than 1. This is the case because the main n-gram model probabilities for all non-zero count words already sum to 1. The fact that the probabilities sum to more than 1 makes perplexity values meaningless. The code below illustrates the problem.
The are several "proper backoff models" that do not have this problem, e.g. the Katz-Backoff method. We refer to other material below for a deeper discussion of these.