Word-based Machine Translation

Machine Translation (MT) is one of the canonical NLP applications, and one that nowadays most people are familiar with, primarily through online translation services of the major search engine providers. While there is still some way to go before machines can provide fluent and flawless translations, in particular for more distant language pairs like English and Japanese, progress in this field has been remarkable. MT is one of the most widely used NLP application. It is an example of end-to-end NLP, and MT-like architectures can be found in many other applications.

In this chapter we will illustrate the foundations of this progress, and focus on word-based machine translation models. In such models words are the basic unit of translation. Nowadays the field has mostly moved to phrase and syntax-based approaches, but the word-based approach is still important, both from a foundational point of view, and as sub-component in more complex approaches. We focus on word-based MT which is not the state-of-the-art anymore, but a foundation and blueprint for modern mechanisms.

MT as Structured Prediction

Formally we will see MT as the task of translating a source sentence $\source$ to a target sentence $\target$ . We can tackle the problem using the structured prediction recipe: We define a parametrised model $s_\params(\target,\source)$ that measures how well a target $\target$ sentence matches a source sentence $\source$ , learn the parameters $\params$ from training data, and then find Define a parametrised model $s_\params(\target,\source)$ measuring how well a target sentence $\target$ matches a source sentence $\source$ , learn the parameters $\params$ from training data, and find

\begin{matrix} (1) & \underset{t}{\arg max} s_{θ} (t, s) \end{matrix}

$\begin{equation} \label{decode-mt} \argmax_\target s_\params(\target,\source) \end{equation}$

as translation of $\source$ . Different statistical MT approaches, in this view, differ primarily in how $s$ is defined, $\params$ are learned, and how the $\argmax$ is found.

Noisy Channel Model for MT

Many Word-based MT systems, as well as those based on more advanced representations, rely on a Noisy Channel model as choice for the scoring function $s_\params$ . In this approach to MT we effectively model the translation process in reverse. That is, we assume that a probabilistic process (the speaker's brain) first generates the target sentence $\target$ according to the distribution $\prob(\target)$ . Then the target sentence $\target$ is transmitted through a noisy channel $\prob(\source|\target)$ that translates $\target$ into $\source$ .

Hence translation is seen as adding noise to a clean $\target$ . This generative story defines a joint distribution over target and source sentences $\prob(\source,\target) = \prob(\target) \prob(\source|\target)$ . We can in turn operate this distribution in the direction we actually care about: to infer a target sentence $\target$ given a source sentence $\source$ we find the maximum a posteriori sentence The distribution over target words and the channel noise model define a joint distribution $\prob(\source,\target) = \prob(\target) \prob(\source|\target)$ . To use it for translation we operate it backwards:

\begin{matrix} (2) & t^{*} = \underset{t}{\arg max} p (t | s) = \underset{t}{\arg max} p (t) p (s | t) . \end{matrix}

$\begin{equation} \label{decode-nc} \target^* = \argmax_\target \prob(\target | \source) = \argmax_\target \prob(\target) \, \prob(\source | \target). \end{equation}$

For the structured prediction recipe this means setting

s_{θ} (t, s) = p (t) p (s | t) .

$s_\params(\target,\source) = \prob(\target) \, \prob(\source | \target).$

In the noisy channel approach for MT the distribution $\prob(\target)$ that generates the target sentence is usually referred to as language model, and the noisy channel is called the translation model. As we have discussed language models earlier, in this chapter we focus on the translation model $\prob(\source|\target)$ .

A Naive Baseline Translation Model

The most straightforward translation model translates words one-by-one, in the order of appearance:

p_{θ}^{Naive} (s | t) = \prod_{i}^{l e n g t h (s)} θ_{s_{i}, t_{i}}

$\prob_\params^\text{Naive}(\ssource|\starget) = \prod_i^{\length{\source}} \param_{\ssource_i,\starget_i}$ where

θ_{s, t}

$\param_{\ssource,\starget}$ is the probability of translating

t

$\starget$ as

s

$\ssource$ .

θ

$\params$ is often referred to as translation table.

For many language pairs one can acquire training sets $\train=\left( \left(\source_i,\target_i\right) \right)_{i=1}^n$ of paired source and target sentences. For example, for French and English the Aligned Hansards of the Parliament of Canada can be used. Given such a training set $\train$ we can learn the parameters $\params$ using the Maximum Likelhood estimator. In the case of our Naive model this amounts to setting Using parallel data of paired source and target sentences, we can train the model using the Maximum Likelihood Estimator:

θ_{s, t} = \frac{#_{D_{t r a i n}} (s, t)}{#_{D_{t r a i n}} (t)}

$\param_{\ssource,\starget} = \frac{\counts{\train}{s,t}}{\counts{\train}{t}}$

Here $\counts{\train}{s,t}$ is the number of times we see target word $t$ translated as source word $s$ , and $\counts{\train}{t}$ the number of times we the target word $t$ in total.

Training the Naive Model

Let us prepare some toy data to show how train this naive model.

4

Notice how we transformed raw strings into Document objects via segment, and how we then fill the training set with Sentence objects by extracting the documents head sentence from each document. This dataset can be used to train the naive model as follows. We can train the naive model as follows.

Let us train on the toy dataset:

Decoding with the Naive Model

Decoding in MT is the task of finding the solution to equation $\ref{decode-mt}$ . That is, we need to find that target sentence with maximum a posteriori probability, which is equivalent to finding the target sentence with maximum likelihood as per equation $\ref{decode-nc}$ . The phrase "decoding" relates to the noisy channel analogy. Somebody generated a message, the channel encodes (translates) this message and the receiver needs to find out what the original message was.

In the naive model decoding is trivial if we assume a unigram language model. We need to choose, for each source word, the target word with maximal product of translation and language model probability. For more complex models this is not sufficient, and we discuss a more powerful decoding method later.

my house the small

The naive model is broken in several ways. Most severely, it ignores the fact that word order can differ and still yield (roughly) the same meaning.

IBM Model 2

The IBM Model 2 is one of the most influential translation models, even though these days it is only indirectly used in actual MT systems, for example to initialize translation and alignment models. As IBM Model 2 can be understood as generalization of IBM Model 1, we omit the latter for now and briefly illustrate it afterward our introduction of Model 2. Notice that parts of these exposition are based on the excellent lecture notes on IBM Model 1 and 2 of Mike Collins. IBM Model 2 is one of the most influential translation models, even though these days it is only indirectly used in actual MT systems, for example to initialize translation and alignment models.

Alignment

The core difference of Model 2 to our naive baseline model is the introduction of latent auxiliary variables: the word to word alignment $\aligns$ between words. In particular, we introduce a variable $a_i \in [0 \ldots \length{\target}]$ for each source sentence index $i \in [1 \ldots \length{\source}]$ . The word alignment $a_i = j$ means that the source word at token $i$ is aligned with the target word at index $j$ .

Notice that $\align_i$ can be $0$ . This corresponds to a imaginary NULL token $\starget_0$ in the target sentence and allows source words to be omitted in an alignment.

Below you see a simple example of an alignment.

IBM Model 2 defines a conditional distribution $\prob(\source,\aligns|\target)$ over both the source sentence $\source$ and its alignment $\aligns$ to the target sentence $\target$ . Such a model can be used as translation model $\prob(\source|\target)$ , as defined above, by marginalizing out the alignment Model 2 defines a distribution $\prob(\source,\aligns|\target)$ . A translation model can be derived by marginalizing out the alignments

p (s | t) = \sum_{a} p (s, a | t) .

$\prob(\source|\target) = \sum_{\aligns} \prob(\source,\aligns|\target).$

Model Parametrization

IBM Model 2 defines its conditional distribution over source and alignments using two sets of parameters $\params=(\balpha,\bbeta)$ . Here $\alpha(\ssource|\starget)$ is a parameter defining the probability of translation target word $\starget$ into source word $\ssource$ , and $\beta(j|i,l_\starget,l_\ssource)$ a parameter that defines the probability of aligning the source word at token $i$ with the target word at token $j$ , conditioned on the length $l_\starget$ of the target sentence, and the length $l_\ssource$ of the source sentence.

With the above parameters, IBM Model 2 defines a conditional distribution over source sentences and alignments, conditioned on a target sentence and a desired source sentence length $l_\ssource$ : Model 2 defines a conditional distribution over source sentences and alignments, conditioned on a target sentence and a desired source sentence length

\begin{matrix} (3) & p_{θ}^{IBM2} (s_{1} \dots s_{l_{s}}, a_{1} \dots a_{l_{s}} | t_{1} \dots t_{l_{t}}, l_{s}) = \prod_{i}^{l_{s}} α (s_{i} | t_{a_{i}}) β (a_{i} | i, l_{t}, l_{s}) \end{matrix}

$\begin{equation} \label{ibm2} p_\params^\text{IBM2}(\ssource_1 \ldots \ssource_{l_\ssource},\align_1 \ldots \align_{l_\ssource}|\starget_1 \ldots \starget_{l_\starget}, l_\ssource) = \prod_i^{l_\ssource} \alpha(\ssource_i|\starget_{a_i}) \beta(a_i|i,l_\starget,l_\ssource) \end{equation}$

Training IBM Model 2 with the EM Algorithm

Training IBM Model 2 is less straightforward than training our naive baseline. The main reason is the lack of gold alignments in the training data. That is, while we can quite easily find, or heuristically construct, sentence-aligned corpora like our toy dataset, we generally do not have word aligned sentences.

To overcome this problem, IBM Model can be trained using the Expectation Maximization (EM) Algorithm, a general recipe when learning with partially observed data—in our case the data is partially observed because we observe the source and target sentences, but not their alignments. The EM algorithm maximizes a lower bound of the log-likelihood of the data. The log-likelihood of the data is The Expectation Maximization (EM) Algorithm is a standard training algorithm for partially observed data. It maximizes a lower bound of the likelihood:

\sum_{(t_{i}, s_{i}) \in D_{t r a i n}} \log p_{θ}^{IBM2} (s_{i} | t_{i}) = \sum_{(t_{i}, s_{i}) \in D_{t r a i n}} \log \sum_{a} p_{θ}^{IBM2} (s_{i}, a | t_{i})

$\sum_{(\target_i,\source_i) \in \train} \log p_\params^\text{IBM2}(\source_i|\target_i) = \sum_{(\target_i,\source_i) \in \train} \log \sum_{\aligns} p_\params^\text{IBM2}(\source_i,\aligns|\target_i)$

EM can be be seen as block coordinate descent on this bound.

The EM algorithm is an iterative method that iterates between two steps, the E-step (Expectation) and the M-Step (Maximization), until convergence. For the case of IBM Model 2 the E and M steps are instantiated as follows: EM for Model 2 iterates between:

E-Step: given a current set of parameters $\params$ , calculate the expectations $\pi$ of the latent alignment variables under the model $p_\params^\text{IBM2}$ — this amounts to estimating a soft alignment for each sentence.
M-Step: Given training set of soft alignments $\pi$ , find new parameters $\params$ that maximize the log likelihood of this (weighted) training set. This amounts to soft counting.

E-Step

The E-Step calculates the distribution

π (a | s, t) = p_{θ}^{IBM2} (a | s, t)

$\pi(\aligns|\source,\target) = p_\params^\text{IBM2}(\aligns|\source,\target)$

for the current parameters $\params$ . For Model 2 this distribution has a very simple form:

π (a | s, t) = \prod_{i}^{l_{s}} π (a_{i} | s, t, i) = \prod_{i}^{l_{s}} \frac{α (s_{i} | t_{a_{i}}) β (a_{i} | i, l_{t}, l_{s})}{\sum_{j}^{l_{t}} α (s_{i} | t_{j}) β (j | i, l_{t}, l_{s})}

$\pi(\aligns|\source,\target) = \prod_i^{l_{\ssource}} \pi(a_i|\source,\target,i) = \prod_i^{l_{\ssource}} \frac {\alpha(\ssource_i|\starget_{a_i}) \beta(a_i|i,l_\starget,l_\ssource)} {\sum_j^{l_{\starget}} \alpha(\ssource_i|\starget_j) \beta(j|i,l_\starget,l_\ssource) }$

Importantly, the distribution over alignments factorizes in a per-source-token fashion, and hence we only need to calculate, for each source token $i$ and each possible alignment $a_i$ , the probability (or expectation) $\pi(a_i|\source,\target,i)$ .

Before we look at the implementation of this algorithm we will set up the training data to be compatible with our formulation. This involves introducing a 'NULL' token to each target sentence to allow source tokens to remain unaligned. We also gather a few statistics that will be useful in our implementation later on. First we create some toy data:

We can now implement the E-Step.

Let us run this code:

You can play around with the initialization of $\bbeta$ to see how the alignments react to changes of the word-to-word translation probabilities.

M-Step

The M-Step optimizes a weighted or expected version of the log-likelihood of the data, using the distribution $\pi$ from the last E-Step:

θ^{*} = \underset{θ}{\arg max} \sum_{(t, s) \in D_{t r a i n}} \sum_{a} π (a | t, s) \log p_{θ}^{IBM2} (s, a | t)

$\params^* = \argmax_\params \sum_{(\target,\source) \in \train} \sum_\aligns \pi(\aligns|\target,\source) \log \prob _\params^\text{IBM2}(\source,\aligns|\target)$

The summing over hidden alignments seems daunting, but because $\pi$ factorizes as we discussed above, we again have a simple closed-form solution:

α (s | t) = \frac{\sum_{(t, s)} \sum_{i}^{l_{s}} \sum_{j}^{l_{t}} π (j | i) δ (s, s_{i}) δ (t, t_{j})}{\sum_{(t, s)} \sum_{j}^{l_{t}} δ (t, t_{j})}

$\alpha(\ssource|\starget) = \frac {\sum_{(\target,\source)}\sum_i^{l_\source} \sum_j^{l_\target} \pi(j|i) \delta(\ssource,\ssource_i) \delta(\starget,\starget_j) } {\sum_{(\target,\source)} \sum_j^{l_\target} \delta(\starget,\starget_j) }$

where $\delta(x,y)$ is 1 if $x=y$ and 0 otherwise. The updates for $\beta$ are similar.

Let us implement the M-Step now. In this step we estimate parameters $\params$ from a given set of (soft) alignments $\aligns$ .

Let us run one M-Step on the alignments we estimated earlier.

Notice that the algorithm already figured out that "is" is most likely translated to "ist". This is because it is (softly) aligned with "is" in every sentence, whereas other German words only appear in a subset of the sentences.

Initialization (IBM Model 1)

We could already iteratively call eStep and mStepuntil convergence. However, a crucial question is how to initialize the model parameters for the first call to 'eStep'. So far we used a uniform initialization, but given that the EM algorithm's results usually depend significantly on initialization, using a more informed starting point can be useful. Due to non-convexity of the EM bound, good initialization is crucial.

A common way to initialize EM for IBM Model 2 training is to first train the so called IBM Model 1 using EM. This model really is an instantiation of Model 2 with a specific and fixed alignment parameter set $\bbeta$ . Instead of estimating $\bbeta$ it is set to assign uniform probability to all target tokens with respect to a given length: We can train IBM Model 1, which fixes the distortion parameters $\bbeta$ to be uniform:

β (a_{i} | i, l_{t}, l_{s}) = \frac{1}{l_{t} + 1}

$\beta(a_i|i,l_\starget,l_\ssource) = \frac{1}{l_\starget + 1}$

After training the parameters $\params$ of Model 1 can be used to initialize EM for Model 2.

Training Model 1 using EM could have the same initialization problem. Fortunately it turns out that with $\bbeta$ fixed in this way it can be shown, under mild conditions, that EM will converge to a global optimum, making IBM Model 1 robust to choices of initialization. EM for IBM Model 1 converges to a global optimum.

Let us train IBM Model 1 now. This amounts to using our previous eStep and mStep methods, initializing $\bbeta$ as above and not updating it during mStep.

You can see below that the alignments converge relatively quickly.

Exercise 1

Can you think of other reasonable measures for convergence? Hint: consider the formal derivation of EM.

Let us have a look at the translation table.

We can also inspect the alignments generated during EM.

Training IBM Model 2

Now that we have a reasonable initial model we can use it to initialize EM for IBM Model 2. Here is the EM code in full.

Initializing with the IBM Model 1 result gives us:

For alignments we get:

Let us look at the distortion probabilities for a given source position and source and target lengths.

Decoding for IBM Model 2

Decoding IBM Model 2 requires us to solve the argmax problem in equation $\ref{decode-nc}$ , this time using the conditional probability from equation $\ref{ibm2}$ with the hidden alignments marginalized out: Decoding IBM Model 2 ideally means:

\begin{matrix} (4) & \underset{t}{\arg max} p_{θ}^{IBM2} (s | t) = \underset{t}{\arg max} \sum_{a} p_{θ}^{IBM2} (s, a | t) \end{matrix}

$\begin{equation} \argmax_{\target} p_\params^\text{IBM2}(\source | \target) = \argmax_{\target} \sum_{\aligns} p_\params^\text{IBM2}(\source,\aligns | \target) \end{equation}$

This nested argmax and sum is generally computationally very hard (see Park and Darwiche), and often replaced with the simpler problem of finding a combination of best target sequence and corresponding alignment. This nested argmax/sum is generally computationally very hard. It's easier to argmax over target and alignments:

\begin{matrix} (5) & \underset{t, a}{\arg max} p_{θ}^{IBM2} (s, a | t) \end{matrix}

$\begin{equation} \argmax_{\target,\aligns} p_\params^\text{IBM2}(\source,\aligns | \target) \end{equation}$

As it turns out for IBM Model 2 the sum can be efficiently calculated, and Wang and Waibel show a stack based decoder that does take this into account.

However, both for simplicity of exposition and because for most real-world models this marginalization is not possible, we present a decoder that searches over both target and alignment. To simplify the algorithm further we assume that target and source sentences have to have them same length. Of course this is a major restriction, and it is not necessary, but makes the algorithm easier to explain while maintaining the core mechanism. Here we only show only the Scala code and refer the reader to our slides for an illustration of how stack and beam based decoders work. A simplified same-length decoder in Scala, using a beam:

Let us test this decoder on a simple sentence, using a uniform language model.

NULL	_ _ _ _	4	0.0
NULL a	_ _ ein _	3	-2.3978952727983707
NULL man	groß _ _ _	3	-Infinity
NULL a man	_ _ ein Mann	2	-4.795790545596741
NULL a my	groß _ ein _	2	-Infinity
NULL a man is	_ ist ein Mann	1	-7.886832998955057
NULL a man man	_ ist ein Mann	1	-Infinity
NULL a man is big	groß ist ein Mann	0	-10.284728271753428
NULL a man is tall	groß ist ein Mann	0	-10.284728271753428

There are currently two contenders for the most likely translation. This is because the translation model is uncertain about the translation of "groß" which can be "tall" in the context of the height of humans, and "big" in most other settings. To avoid this uncertainty we can use a language model to capture the fact that "man is big" is a little less likely than "man is tall". We can reduce ambiguity by using a better language model

NULL	_ _ _ _	4	0.0
NULL a	_ _ ein _	3	-2.4849066497880004
NULL man	groß _ _ _	3	-Infinity
NULL a man	_ _ ein Mann	2	-3.1780538303479458
NULL a tall	_ ist ein _	2	-Infinity
NULL a man is	_ ist ein Mann	1	-4.564348191467836
NULL a man NULL	_ ist ein Mann	1	-6.962243464266207
NULL a man is tall	groß ist ein Mann	0	-5.2574953720277815
NULL a man is big	groß ist ein Mann	0	-7.655390644826152

Note that "a man is tall" is also more likely in the Google N-grams corpus.

Summary

There are a few high level messages to take away from this chapter.

MT is an instance structured prediction recipe
The noisy channel is one modeling framework
word-based MT is foundation and blue print for more complex models
Training with EM
NLP Tricks:
- introducing latent alignment variables to simplify problem
- decoding with Beams

Background Material

Lecture notes on IBM Model 1 and 2 of Mike Collins.
Jurafsky & Martin, Speech and Language Processing:
- Chapter 26, Machine Translation.
- Chapter 6, EM Algorithm

StatNLP