Word-based Machine Translation

MT is one of the most widely used NLP application. It is an example of end-to-end NLP, and MT-like architectures can be found in many other applications.

We focus on word-based MT which is not the state-of-the-art anymore, but a foundation and blueprint for modern mechanisms.

MT as Structured Prediction

Define a parametrised model $s_\params(\target,\source)$ measuring how well a target sentence $\target$ matches a source sentence $\source$ , learn the parameters $\params$ from training data, and find

\begin{matrix} (1) & \underset{t}{\arg max} s_{θ} (t, s) \end{matrix}

$\begin{equation} \label{decode-mt} \argmax_\target s_\params(\target,\source) \end{equation}$

Noisy Channel Model for MT

A generative process (the speaker's brain):

generate target sentence $\target$ according to $\prob(\target)$ .
Transmit $\target$ through a noisy channel $\prob(\source|\target)$ to generate source $\source$ .

For a graphical view on the noisy channel see my slides.

The distribution over target words and the channel noise model define a joint distribution $\prob(\source,\target) = \prob(\target) \prob(\source|\target)$ . To use it for translation we operate it backwards:

\begin{matrix} (2) & t^{*} = \underset{t}{\arg max} p (t | s) = \underset{t}{\arg max} p (t) p (s | t) . \end{matrix}

$\begin{equation} \label{decode-nc} \target^* = \argmax_\target \prob(\target | \source) = \argmax_\target \prob(\target) \, \prob(\source | \target). \end{equation}$

For the structured prediction recipe this means setting

s_{θ} (t, s) = p (t) p (s | t) .

$s_\params(\target,\source) = \prob(\target) \, \prob(\source | \target).$

In Machine Translation:

$\prob(\target)$ is usually referred to as language model.
$\prob(\source|\target)$ is referred to as translation model.

Here we focus on translation models.

A Naive Baseline Translation Model

The most straightforward translation model translates words one-by-one, in the order of appearance:

p_{θ}^{Naive} (s | t) = \prod_{i}^{l e n g t h (s)} θ_{s_{i}, t_{i}}

$\prob_\params^\text{Naive}(\ssource|\starget) = \prod_i^{\length{\source}} \param_{\ssource_i,\starget_i}$ where

θ_{s, t}

$\param_{\ssource,\starget}$ is the probability of translating

t

$\starget$ as

s

$\ssource$ .

θ

$\params$ is often referred to as translation table.

Using parallel data of paired source and target sentences, we can train the model using the Maximum Likelihood Estimator:

θ_{s, t} = \frac{#_{D_{t r a i n}} (s, t)}{#_{D_{t r a i n}} (t)}

$\param_{\ssource,\starget} = \frac{\counts{\train}{s,t}}{\counts{\train}{t}}$

Here $\counts{\train}{s,t}$ is the number of times we see target word $t$ translated as source word $s$ , and $\counts{\train}{t}$ the number of times we the target word $t$ in total.

Training the Naive Model

Let us prepare some toy data to show how train this naive model.

4

We can train the naive model as follows.

Let us train on the toy dataset:

Decoding with the Naive Model

Translation is often called decoding and means:

\underset{t}{\arg max} p (t) p (s | t)

$\argmax_\target \prob(\target) \, \prob(\source | \target)$ How can we achieve this for the Naive Model?

my house the small

The naive model is broken in several ways. Most severely, it ignores the fact that word order can differ and still yield (roughly) the same meaning.

IBM Model 2

IBM Model 2 is one of the most influential translation models, even though these days it is only indirectly used in actual MT systems, for example to initialize translation and alignment models.

Alignment

The core difference of Model 2 is the introduction of latent auxiliary variables $\aligns$ :

$a_i \in [0 \ldots \length{\target}]$ for each source token $i \in [1 \ldots \length{\source}]$
$a_i = j$ means source token $i$ is aligned with the target token $j$ .
$\align_i$ can be $0$ to point to a NULL token to omit source words.

Below you see a simple example of an alignment.

Model 2 defines a distribution $\prob(\source,\aligns|\target)$ . A translation model can be derived by marginalizing out the alignments

p (s | t) = \sum_{a} p (s, a | t) .

$\prob(\source|\target) = \sum_{\aligns} \prob(\source,\aligns|\target).$

Model Parametrization

IBM Model 2 has two sets of parameters $\params=(\balpha,\bbeta)$

$\alpha(\ssource|\starget)$ : probability of translation target word $\starget$ into source word $\ssource$ .
$\beta(j|i,l_\starget,l_\ssource)$ : probability of aligning the source token $i$ with target token $j$ , conditioned on the target length $l_\starget$ and source length $l_\ssource$ .

Model 2 defines a conditional distribution over source sentences and alignments, conditioned on a target sentence and a desired source sentence length

\begin{matrix} (3) & p_{θ}^{IBM2} (s_{1} \dots s_{l_{s}}, a_{1} \dots a_{l_{s}} | t_{1} \dots t_{l_{t}}, l_{s}) = \prod_{i}^{l_{s}} α (s_{i} | t_{a_{i}}) β (a_{i} | i, l_{t}, l_{s}) \end{matrix}

$\begin{equation} \label{ibm2} p_\params^\text{IBM2}(\ssource_1 \ldots \ssource_{l_\ssource},\align_1 \ldots \align_{l_\ssource}|\starget_1 \ldots \starget_{l_\starget}, l_\ssource) = \prod_i^{l_\ssource} \alpha(\ssource_i|\starget_{a_i}) \beta(a_i|i,l_\starget,l_\ssource) \end{equation}$

Training IBM Model 2 with the EM Algorithm

Training Model 2 is challenging:

we have gold sentence alignments
but no gold word alignments!

The Expectation Maximization (EM) Algorithm is a standard training algorithm for partially observed data. It maximizes a lower bound of the likelihood

\sum_{(t_{i}, s_{i}) \in D_{t r a i n}} \log p_{θ}^{IBM2} (s_{i} | t_{i}) = \sum_{(t_{i}, s_{i}) \in D_{t r a i n}} \log \sum_{a} p_{θ}^{IBM2} (s_{i}, a | t_{i})

$\sum_{(\target_i,\source_i) \in \train} \log p_\params^\text{IBM2}(\source_i|\target_i) = \sum_{(\target_i,\source_i) \in \train} \log \sum_{\aligns} p_\params^\text{IBM2}(\source_i,\aligns|\target_i)$

EM can be be seen as block coordinate descent on this bound.

EM for Model 2 iterates between:

E-Step: given a current set of parameters $\params$ , calculate the expectations $\pi$ of the latent alignment variables under the model $p_\params^\text{IBM2}$ — this amounts to estimating a soft alignment for each sentence.
M-Step: Given training set of soft alignments $\pi$ , find new parameters $\params$ that maximize the log likelihood of this (weighted) training set. This amounts to soft counting.

E-Step

Calculate distribution over alignments

π (a | s, t) = p_{θ}^{IBM2} (a | s, t)

$\pi(\aligns|\source,\target) = p_\params^\text{IBM2}(\aligns|\source,\target)$

Distribution factorizes:

π (a | s, t) = \prod_{i}^{l_{s}} π (a_{i} | s, t, i) = \prod_{i}^{l_{s}} \frac{α (s_{i} | t_{a_{i}}) β (a_{i} | i, l_{t}, l_{s})}{\sum_{j}^{l_{t}} α (s_{i} | t_{j}) β (j | i, l_{t}, l_{s})}

$\pi(\aligns|\source,\target) = \prod_i^{l_{\ssource}} \pi(a_i|\source,\target,i) = \prod_i^{l_{\ssource}} \frac {\alpha(\ssource_i|\starget_{a_i}) \beta(a_i|i,l_\starget,l_\ssource)} {\sum_j^{l_{\starget}} \alpha(\ssource_i|\starget_j) \beta(j|i,l_\starget,l_\ssource) }$

First we create some toy data:

We can now implement the E-Step.

Let us run this code:

You can play around with the initialization of $\bbeta$ to see how the alignments react to changes of the word-to-word translation probabilities.

M-Step

Calculate weighted maximum likelihood estimate:

θ^{*} = \underset{θ}{\arg max} \sum_{(t, s) \in D_{t r a i n}} \sum_{a} π (a | t, s) \log p_{θ}^{IBM2} (s, a | t)

$\params^* = \argmax_\params \sum_{(\target,\source) \in \train} \sum_\aligns \pi(\aligns|\target,\source) \log \prob _\params^\text{IBM2}(\source,\aligns|\target)$

Closed form solution (weighted counting):

α (s | t) = \frac{\sum_{(t, s)} \sum_{i}^{l_{s}} \sum_{j}^{l_{t}} π (j | i) δ (s, s_{i}) δ (t, t_{j})}{\sum_{(t, s)} \sum_{j}^{l_{t}} δ (t, t_{j})}

$\alpha(\ssource|\starget) = \frac {\sum_{(\target,\source)}\sum_i^{l_\source} \sum_j^{l_\target} \pi(j|i) \delta(\ssource,\ssource_i) \delta(\starget,\starget_j) } {\sum_{(\target,\source)} \sum_j^{l_\target} \delta(\starget,\starget_j) }$

$\delta(x,y)$ is 1 if $x=y$ and 0 otherwise

Let us implement the M-Step now. In this step we estimate parameters $\params$ from a given set of (soft) alignments $\aligns$ .

Let us run one M-Step on the alignments we estimated earlier.

Notice that the algorithm already figured out that "is" is most likely translated to "ist". This is because it is (softly) aligned with "is" in every sentence, whereas other German words only appear in a subset of the sentences.

Initialization (IBM Model 1)

Due to non-convexity of the EM bound, good initialization is crucial.

We can train IBM Model 1, which fixes the distortion parameters $\bbeta$ to be uniform:

β (a_{i} | i, l_{t}, l_{s}) = \frac{1}{l_{t} + 1}

$\beta(a_i|i,l_\starget,l_\ssource) = \frac{1}{l_\starget + 1}$

After training the parameters $\params$ of Model 1 can be used to initialize EM for Model 2.

EM for IBM Model 1 converges to a global optimum.

Let us train IBM Model 1 now. This amounts to using our previous eStep and mStep methods, initializing $\bbeta$ as above and not updating it during mStep.

You can see below that the alignments converge relatively quickly.

Exercise 1

Can you think of other reasonable measures for convergence? Hint: consider the formal derivation of EM.

Let us have a look at the translation table.

We can also inspect the alignments generated during EM.

Training IBM Model 2

Now that we have a reasonable initial model we can use it to initialize EM for IBM Model 2. Here is the EM code in full.

Initializing with the IBM Model 1 result gives us:

For alignments we get:

Let us look at the distortion probabilities for a given source position and source and target lengths.

Decoding for IBM Model 2

Decoding IBM Model 2 ideally means:

\begin{matrix} (4) & \underset{t}{\arg max} p_{θ}^{IBM2} (s | t) = \underset{t}{\arg max} \sum_{a} p_{θ}^{IBM2} (s, a | t) \end{matrix}

$\begin{equation} \argmax_{\target} p_\params^\text{IBM2}(\source | \target) = \argmax_{\target} \sum_{\aligns} p_\params^\text{IBM2}(\source,\aligns | \target) \end{equation}$

This nested argmax/sum is generally computationally very hard. It's easier to argmax over target and alignments:

\begin{matrix} (5) & \underset{t, a}{\arg max} p_{θ}^{IBM2} (s, a | t) \end{matrix}

$\begin{equation} \argmax_{\target,\aligns} p_\params^\text{IBM2}(\source,\aligns | \target) \end{equation}$

A simplified same-length decoder in Scala, using a beam:

Let us test this decoder on a simple sentence, using a uniform language model.

NULL	_ _ _ _	4	0.0
NULL a	_ _ ein _	3	-2.3978952727983707
NULL man	groß _ _ _	3	-Infinity
NULL a man	_ _ ein Mann	2	-4.795790545596741
NULL a my	groß _ ein _	2	-Infinity
NULL a man is	_ ist ein Mann	1	-7.886832998955057
NULL a man man	_ ist ein Mann	1	-Infinity
NULL a man is big	groß ist ein Mann	0	-10.284728271753428
NULL a man is tall	groß ist ein Mann	0	-10.284728271753428

We can reduce ambiguity by using a better language model

NULL	_ _ _ _	4	0.0
NULL a	_ _ ein _	3	-2.4849066497880004
NULL man	groß _ _ _	3	-Infinity
NULL a man	_ _ ein Mann	2	-3.1780538303479458
NULL a tall	_ ist ein _	2	-Infinity
NULL a man is	_ ist ein Mann	1	-4.564348191467836
NULL a man NULL	_ ist ein Mann	1	-6.962243464266207
NULL a man is tall	groß ist ein Mann	0	-5.2574953720277815
NULL a man is big	groß ist ein Mann	0	-7.655390644826152

Note that "a man is tall" is also more likely in the Google N-grams corpus.

Summary

There are a few high level messages to take away from this chapter.

MT is an instance structured prediction recipe
The noisy channel is one modeling framework
word-based MT is foundation and blue print for more complex models
Training with EM
NLP Tricks:
- introducing latent alignment variables to simplify problem
- decoding with Beams

Background Material

Lecture notes on IBM Model 1 and 2 of Mike Collins.
Jurafsky & Martin, Speech and Language Processing:
- Chapter 26, Machine Translation.
- Chapter 6, EM Algorithm

StatNLP