$\def\prob{p} \def\vocab{V} \def\params{\boldsymbol{\theta}} \def\argmax{\mathop{\arg\,\max}} \def\param{\theta} \def\bpi{\boldsymbol{\pi}} \def\balpha{\boldsymbol{\alpha}} \def\bbeta{\boldsymbol{\beta}} \def\perplexity{PP} \def\x{\mathbf{x}} \def\y{\mathbf{y}} \def\Xs{\mathcal{X}} \def\Ys{\mathcal{Y}} \def\perplexity{PP} \def\train{\mathcal{D}_\mathit{train}} \def\counts#1#2{\#_{#1}(#2)} \def\RR{\bf R} \def\length#1{\mathop{length}(#1)} \def\aligns{\mathbf{a}} \def\align{a} \def\source{\mathbf{s}} \def\ssource{s} \def\target{\mathbf{t}} \def\starget{t}$

Language Models

Language models (LMs) calculate the probability to see a given sequence of words.

There are several use cases for such models:

To filter out bad translations in machine translation.
To rank speech recognition output.
In concept-to-text generation.

Without loss of generality

p (w_{1}, \dots, w_{d}) = p (w_{1}) \prod_{i = 2}^{d} p (w_{i} | w_{1}, \dots, w_{i - 1}) .

$\prob(w_1,\ldots,w_d) = \prob(w_1) \prod_{i = 2}^d \prob(w_i|w_1,\ldots,w_{i-1}).$

We only need to model how words are generated based on a history.

In practice it is common to define language models based on equivalence classes of histories.

N-gram Language Models

The most common type of equivalence class relies on truncating histories $w_1,\ldots,w_{i-1}$ to length $n-1$ :

p (w_{i} | w_{1}, \dots, w_{i - 1}) = p (w_{i} | w_{i - n}, \dots, w_{i - 1}) .

$\prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i|w_{i-n},\ldots,w_{i-1}).$ That is, the probability of a word only depends on the last

n - 1

$n-1$ previous words. We will refer to such model as a n-gram language model.

A Uniform Baseline LM

Unigram models are the simplest 1-gram language models. That is, they model the conditional probability of word using the prior probability of seeing that word:

p (w_{i} | w_{1}, \dots, w_{i - 1}) = p (w_{i}) .

$\prob(w_i|w_1,\ldots,w_{i-1}) = \prob(w_i).$

Given a vocabulary of words $\vocab$ , the uniform LM is defined as:

p (w_{i} | w_{1}, \dots, w_{i - 1}) = \frac{1}{| V |} .

$\prob(w_i|w_1,\ldots,w_{i-1}) = \frac{1}{|\vocab|}.$

Let us "train" and test such a language model on the OHHLA corpus. First we need to load this corpus. Below we focus on a subset to make our code more responsive and to allow us to test models more quickly.

[BAR] Can 't even call this a blues song [/BAR] [BAR] It 's been so long [/BAR] [BAR] Neither one of us was wrong or anything like that [/BAR] [BAR] It seems like yesterday [/BAR]

We can now create a uniform language model using a built-in constructor. Language models in this book implement the LanguageModel trait.

Let us implement a uniform LM using this trait.

2.86368843069874E-4

Sampling

The quality of an LM can often be gauged by looking at its samples, but models with poorer samples can still be useful.

You can sample word-by-word, the current word based on previously sampled ones.

buts easy competition Lettin' mojo friend control choices-es shake fairytale

Evaluation

Most important for LM quality is their impact on downstream tasks in extrinsic evaluations. This can be expensive to evaluate, hence intrinsic evaluations can be useful.

The perplexity of an LM on a sample $w_1,\ldots,w_T$ is a measure of intrinsic quality:

P P (w_{1}, \dots, w_{T}) = p (w_{1}, \dots, w_{T})^{- \frac{1}{T}} = \sqrt[T]{\prod_{i}^{T} \frac{1}{p (w_{i} | w_{i - n}, \dots, w_{i - 1})}}

$\perplexity(w_1,\ldots,w_T) = \prob(w_1,\ldots,w_T)^{-\frac{1}{T}} = \sqrt[T]{\prod_i^T \frac{1}{\prob(w_i|w_{i-n},\ldots,w_{i-1})}}$

We can implement a perplexity function based on the LanguageModel interface.

Let's see how the uniform model does on our test set.

Infinity

Out-of-Vocabularly Words

The model assigns 0 probability to words not in the training vocabulary.

Vector

Tuple2
- _1 underground
- _2 0.0
Tuple2
- _1 metaphors
- _2 0.0
Tuple2
- _1 scrape
- _2 0.0

The Long Tail

A few words appear repeatedly, and a long tail of words appear a few times but often in aggregate.

Let us observe this phenomenon for our data: we will rank the words according to their frequency, and plot this frequency against the rank.

Zipf's Law states that word frequency $f_w$ is inversely proportional to word rank $r_w$ :

f_{w} \propto \frac{1}{r_{w}} .

$f_w \propto \frac{1}{r_w}.$

Inserting Out-of-Vocabularly Tokens

The long tail of infrequent words is a problem for LMs and NLP in general. It gets even worse for n-grams.

One solution is to replace unseen words with an OOV token in the test set.

Vector

[BAR]
[J-Live]
[/BAR]
[BAR]
For
[OOV]
[OOV]
[/BAR]
[BAR]
You

To get probability estimates for these tokens we can replace the first encounter of each word in the training set with OOV.

Vector

[OOV]
A
[OOV]
B
A

Now we can apply this to our training and test set, and create a new uniform model.

1244.9999999982847

Training Language Models

Let us define a parametrized language model $p_\params$ :

p_{θ} (w | h) = θ_{w, h}

$\prob_\params(w|h) = \param_{w,h}$

Training an n-gram LM amounts to estimating $\params$ from some training set $\train=(w_1,\ldots,w_n)$ . One way to do this is to choose the $\params$ that maximizes the log-likelihood of $\train$ :

θ^{*} = \underset{θ}{\arg max} \log p_{θ} (D_{t r a i n})

$\params^* = \argmax_\params \log p_\params(\train)$

As it turns out, this maximum-log-likelihood estimate (MLE) can calculated in closed form, simply by counting:

θ_{w, h}^{*} = \frac{#_{D_{t r a i n}} (h, w)}{#_{D_{t r a i n}} (h)}

$\param^*_{w,h} = \frac{\counts{\train}{h,w}}{\counts{\train}{h}}$

where

#_{D} (e) = Count of e in D

$\counts{D}{e} = \text{Count of } e \text{ in } D$

Many LMs can be implemented by estimating the counts in the nominator and denominator differently, so let's define a corresponding trait

Let us use this to code up a generic NGram model

Let us train a unigram model.

The unigram LM has substantially reduced (and hence better) perplexity:

88.20631586915746

Let us also look at the language the unigram LM generates.

[OOV] man [BAR] F jaws ) ) range to [BAR]

Bigram LM

The Bigram model conditions the probability of the current word on the previous word.

Let us see how the bigram LM generates language.

[BAR] To open [OOV] [/BAR] [BAR] [/BAR] [BAR] [OOV] from the

Does the bigram model improve perplexity?

Infinity

While every word has been seen, it may not have been seen in every context.

0.0

Smoothing

The general problem is that maximum likelhood estimates will always underestimate the true probability of some words, and in turn overestimate the (context-dependent) probabilities of other words. To overcome this issue we aim to smooth the probabilities and move mass from seen events to unseen events.

Laplace Smoothing

Add pseudo counts to each event in the dataset.

θ_{w, h}^{α} = \frac{#_{D_{t r a i n}} (h, w) + α}{#_{D_{t r a i n}} (h) + α ∣ V ∣}

$\param^{\alpha}_{w,h} = \frac{\counts{\train}{h,w} + \alpha}{\counts{\train}{h} + \alpha \lvert V \rvert }$

Let us implement this in Scala.

7.968127490039841E-4

This should give a better perplexity value:

68.15799924036281

Exercise 2

Can you find a better pseudo-count number?

Adjusted counts

Good to think of smoothing as moving mass and adjusting the counts in the numerator.

Let us reformulate the laplace LM using adjusted counts. Note that we since we have histories with count 0, we do need to increase the original denominator by a small $\epsilon$ to avoid division by zero.

\begin{aligned} #_{D_{t r a i n}, α} (h, w) & = θ_{w, h}^{α} \cdot (#_{D_{t r a i n}} (h) + ϵ) \\ #_{D_{t r a i n}, α} (h) & = #_{D_{t r a i n}} (h) + ϵ \end{aligned}

$\begin{split} \counts{\train,\alpha}{h,w} &= \param^{\alpha}_{w,h} \cdot (\counts{\train}{h} + \epsilon)\\ \counts{\train,\alpha}{h} &= \counts{\train}{h} + \epsilon \end{split}$

Tuple2

_1 625.0
_2 603.5748098741531

Can we test more generally wether our adjusted counts are sensible?

Compare adjusted counts to counts in a held out set:

train-count	test-count	smoothed-count
0.0	0.003623312785682409	0.005007694407734
1.0	0.4578539563277765	0.3189338695509523
2.0	1.2009237875288683	0.7719376450920822
3.0	1.8906666666666667	1.2309497704843773
4.0	3.0714285714285716	1.6623618386822872
5.0	4.366336633663367	2.124690216711694
6.0	4.068965517241379	3.0709006718308984
7.0	5.365384615384615	3.7380632119752257

Interpolation

When we don't have reliable $n$ -gram counts we should use $n-1$ -gram counts.

Tuple2

_1
Tuple2
- _1 [BAR]
- _2 serious
_2 0.09532115739790684

A simple technique to use the $n-1$ gram statistics is interpolation. Here we compose the probability of a word as the weighted sum of the probability of an $n$ -gram model $p'$ and a back-off $n-1$ model $p''$ :

p_{α} (w_{i} | w_{i - n}, \dots, w_{i - 1}) = α \cdot p^{'} (w_{i} | w_{i - n}, \dots, w_{i - 1}) + (1 - α) \cdot p^{″} (w_{i} | w_{i - n + 1}, \dots, w_{i - 1})

$\prob_{\alpha}(w_i|w_{i-n},\ldots,w_{i-1}) = \alpha \cdot \prob'(w_i|w_{i-n},\ldots,w_{i-1}) + (1 - \alpha) \cdot \prob''(w_i|w_{i-n+1},\ldots,w_{i-1})$

Let us interpolate between a bigram and uniform language model, varying the interpolation parameter $\alpha$ .

Backoff

Instead of combining probabilities for all words given a context, it makes sense to back-off only when no counts for a given event are available and rely on available counts where possible.

A particularly simple, if not to say stupid, backoff method is Stupid Backoff. Let $w$ be a word and $h_{n}$ be an n-gram of length $n$ :

p_{Stupid} (w | h_{n}) = {\begin{cases} \frac{#_{D_{t r a i n}} (h_{n}, w)}{#_{D_{t r a i n}} (h_{n})} & = if #_{D_{t r a i n}} (h_{n}, w) > 0 \\ p_{Stupid} (w | h_{n - 1}) & otherwise \end{cases}

$\prob_{\mbox{Stupid}}(w|h_n) = \begin{cases} \frac{\counts{\train}{h_n,w}}{\counts{\train}{h_n}} &= \mbox{if }\counts{\train}{h_n,w} > 0 \\ \prob_{\mbox{Stupid}}(w|h_{n-1}) & \mbox{otherwise} \end{cases}$

It turns out that the Stupid LM is very effective when it comes to extrinsic evaluations, but it doesn't represent a valid probability distribution: when you sum over the probabilities of all words given a history, the result may be larger than 1. This is the case because the main n-gram model probabilities for all non-zero count words already sum to 1. The fact that the probabilities sum to more than 1 makes perplexity values meaningless. The code below illustrates the problem.

1.0709880976810933

The are several "proper backoff models" that do not have this problem, e.g. the Katz-Backoff method. We refer to other material below for a deeper discussion of these.

Exercise 3

Develop and implement a version of the stupid language model that provides probabilities summing up to 1.

Background Reading

Jurafsky & Martin, Speech and Language Processing: Chapter 4, N-Grams.
Bill MacCartney, Stanford NLP Lunch Tutorial: Smoothing

StatNLP

StatNLP

Language Models

N-gram Language Models

A Uniform Baseline LM

Sampling

Evaluation

Out-of-Vocabularly Words

The Long Tail

Inserting Out-of-Vocabularly Tokens

Training Language Models

Bigram LM

Smoothing

Laplace Smoothing

Adjusted counts

Interpolation

Backoff

Background Reading