Machine Translation (MT) is one of the canonical NLP applications, and one that nowadays most people are familiar with, primarily through online translation services of the major search engine providers. While there is still some way to go before machines can provide fluent and flawless translations, in particular for more distant language pairs like English and Japanese, progress in this field has been remarkable. MT is one of the most widely used NLP application. It is an example of end-to-end NLP, and MT-like architectures can be found in many other applications.
In this chapter we will illustrate the foundations of this progress, and focus on word-based machine translation models. In such models words are the basic unit of translation. Nowadays the field has mostly moved to phrase and syntax-based approaches, but the word-based approach is still important, both from a foundational point of view, and as sub-component in more complex approaches. We focus on word-based MT which is not the state-of-the-art anymore, but a foundation and blueprint for modern mechanisms.
Formally we will see MT as the task of translating a source sentence to a target sentence . We can tackle the problem using the structured prediction recipe: We define a parametrised model that measures how well a target sentence matches a source sentence , learn the parameters from training data, and then find Define a parametrised model measuring how well a target sentence matches a source sentence , learn the parameters from training data, and find
as translation of . Different statistical MT approaches, in this view, differ primarily in how is defined, are learned, and how the is found.
Many Word-based MT systems, as well as those based on more advanced representations, rely on a Noisy Channel model as choice for the scoring function . In this approach to MT we effectively model the translation process in reverse. That is, we assume that a probabilistic process (the speaker's brain) first generates the target sentence according to the distribution . Then the target sentence is transmitted through a noisy channel that translates into .
Hence translation is seen as adding noise to a clean . This generative story defines a joint distribution over target and source sentences . We can in turn operate this distribution in the direction we actually care about: to infer a target sentence given a source sentence we find the maximum a posteriori sentence The distribution over target words and the channel noise model define a joint distribution . To use it for translation we operate it backwards:
For the structured prediction recipe this means setting
In the noisy channel approach for MT the distribution that generates the target sentence is usually referred to as language model, and the noisy channel is called the translation model. As we have discussed language models earlier, in this chapter we focus on the translation model .
The most straightforward translation model translates words one-by-one, in the order of appearance:
For many language pairs one can acquire training sets of paired source and target sentences. For example, for French and English the Aligned Hansards of the Parliament of Canada can be used. Given such a training set we can learn the parameters using the Maximum Likelhood estimator. In the case of our Naive model this amounts to setting Using parallel data of paired source and target sentences, we can train the model using the Maximum Likelihood Estimator:
Here is the number of times we see target word translated as source word , and the number of times we the target word in total.
Let us prepare some toy data to show how train this naive model.
Notice how we transformed raw strings into Document objects via segment
, and how we then fill the training set with Sentence objects by extracting the documents head
sentence from each document. This dataset can be used to train the naive model as follows. We can train the naive model as follows.
Let us train on the toy dataset:
Decoding in MT is the task of finding the solution to equation . That is, we need to find that target sentence with maximum a posteriori probability, which is equivalent to finding the target sentence with maximum likelihood as per equation . The phrase "decoding" relates to the noisy channel analogy. Somebody generated a message, the channel encodes (translates) this message and the receiver needs to find out what the original message was.
In the naive model decoding is trivial if we assume a unigram language model. We need to choose, for each source word, the target word with maximal product of translation and language model probability. For more complex models this is not sufficient, and we discuss a more powerful decoding method later.
The naive model is broken in several ways. Most severely, it ignores the fact that word order can differ and still yield (roughly) the same meaning.
The IBM Model 2 is one of the most influential translation models, even though these days it is only indirectly used in actual MT systems, for example to initialize translation and alignment models. As IBM Model 2 can be understood as generalization of IBM Model 1, we omit the latter for now and briefly illustrate it afterward our introduction of Model 2. Notice that parts of these exposition are based on the excellent lecture notes on IBM Model 1 and 2 of Mike Collins. IBM Model 2 is one of the most influential translation models, even though these days it is only indirectly used in actual MT systems, for example to initialize translation and alignment models.
The core difference of Model 2 to our naive baseline model is the introduction of latent auxiliary variables: the word to word alignment between words. In particular, we introduce a variable for each source sentence index . The word alignment means that the source word at token is aligned with the target word at index .
Notice that can be . This corresponds to a imaginary NULL token in the target sentence and allows source words to be omitted in an alignment.
Below you see a simple example of an alignment.
IBM Model 2 defines a conditional distribution over both the source sentence and its alignment to the target sentence . Such a model can be used as translation model , as defined above, by marginalizing out the alignment Model 2 defines a distribution . A translation model can be derived by marginalizing out the alignments
IBM Model 2 defines its conditional distribution over source and alignments using two sets of parameters . Here is a parameter defining the probability of translation target word into source word , and a parameter that defines the probability of aligning the source word at token with the target word at token , conditioned on the length of the target sentence, and the length of the source sentence.
With the above parameters, IBM Model 2 defines a conditional distribution over source sentences and alignments, conditioned on a target sentence and a desired source sentence length : Model 2 defines a conditional distribution over source sentences and alignments, conditioned on a target sentence and a desired source sentence length
Training IBM Model 2 is less straightforward than training our naive baseline. The main reason is the lack of gold alignments in the training data. That is, while we can quite easily find, or heuristically construct, sentence-aligned corpora like our toy dataset, we generally do not have word aligned sentences.
To overcome this problem, IBM Model can be trained using the Expectation Maximization (EM) Algorithm, a general recipe when learning with partially observed data—in our case the data is partially observed because we observe the source and target sentences, but not their alignments. The EM algorithm maximizes a lower bound of the log-likelihood of the data. The log-likelihood of the data is The Expectation Maximization (EM) Algorithm is a standard training algorithm for partially observed data. It maximizes a lower bound of the likelihood:
EM can be be seen as block coordinate descent on this bound.
The EM algorithm is an iterative method that iterates between two steps, the E-step (Expectation) and the M-Step (Maximization), until convergence. For the case of IBM Model 2 the E and M steps are instantiated as follows: EM for Model 2 iterates between:
The E-Step calculates the distribution
for the current parameters . For Model 2 this distribution has a very simple form:
Importantly, the distribution over alignments factorizes in a per-source-token fashion, and hence we only need to calculate, for each source token and each possible alignment , the probability (or expectation) .
Before we look at the implementation of this algorithm we will set up the training data to be compatible with our formulation. This involves introducing a 'NULL' token to each target sentence to allow source tokens to remain unaligned. We also gather a few statistics that will be useful in our implementation later on. First we create some toy data:
We can now implement the E-Step.
Let us run this code:
You can play around with the initialization of to see how the alignments react to changes of the word-to-word translation probabilities.
The M-Step optimizes a weighted or expected version of the log-likelihood of the data, using the distribution from the last E-Step:
The summing over hidden alignments seems daunting, but because factorizes as we discussed above, we again have a simple closed-form solution:
where is 1 if and 0 otherwise. The updates for are similar.
Let us implement the M-Step now. In this step we estimate parameters from a given set of (soft) alignments .
Let us run one M-Step on the alignments we estimated earlier.
Notice that the algorithm already figured out that "is" is most likely translated to "ist". This is because it is (softly) aligned with "is" in every sentence, whereas other German words only appear in a subset of the sentences.
We could already iteratively call eStep
and mStep
until convergence. However, a crucial question is how to initialize the model parameters for the first call to 'eStep'. So far we used a uniform initialization, but given that the EM algorithm's results usually depend significantly on initialization, using a more informed starting point can be useful. Due to non-convexity of the EM bound, good initialization is crucial.
A common way to initialize EM for IBM Model 2 training is to first train the so called IBM Model 1 using EM. This model really is an instantiation of Model 2 with a specific and fixed alignment parameter set . Instead of estimating it is set to assign uniform probability to all target tokens with respect to a given length: We can train IBM Model 1, which fixes the distortion parameters to be uniform:
After training the parameters of Model 1 can be used to initialize EM for Model 2.
Training Model 1 using EM could have the same initialization problem. Fortunately it turns out that with fixed in this way it can be shown, under mild conditions, that EM will converge to a global optimum, making IBM Model 1 robust to choices of initialization. EM for IBM Model 1 converges to a global optimum.
Let us train IBM Model 1 now. This amounts to using our previous eStep
and mStep
methods, initializing as above and not updating it during mStep
.
You can see below that the alignments converge relatively quickly.
Let us have a look at the translation table.
We can also inspect the alignments generated during EM.
Now that we have a reasonable initial model we can use it to initialize EM for IBM Model 2. Here is the EM code in full.
Initializing with the IBM Model 1 result gives us:
For alignments we get:
Let us look at the distortion probabilities for a given source position and source and target lengths.
Decoding IBM Model 2 requires us to solve the argmax problem in equation , this time using the conditional probability from equation with the hidden alignments marginalized out: Decoding IBM Model 2 ideally means:
This nested argmax and sum is generally computationally very hard (see Park and Darwiche), and often replaced with the simpler problem of finding a combination of best target sequence and corresponding alignment. This nested argmax/sum is generally computationally very hard. It's easier to argmax over target and alignments:
As it turns out for IBM Model 2 the sum can be efficiently calculated, and Wang and Waibel show a stack based decoder that does take this into account.
However, both for simplicity of exposition and because for most real-world models this marginalization is not possible, we present a decoder that searches over both target and alignment. To simplify the algorithm further we assume that target and source sentences have to have them same length. Of course this is a major restriction, and it is not necessary, but makes the algorithm easier to explain while maintaining the core mechanism. Here we only show only the Scala code and refer the reader to our slides for an illustration of how stack and beam based decoders work. A simplified same-length decoder in Scala, using a beam:
Let us test this decoder on a simple sentence, using a uniform language model.
NULL | _ _ _ _ | 4 | 0.0 |
NULL a | _ _ ein _ | 3 | -2.3978952727983707 |
NULL man | groß _ _ _ | 3 | -Infinity |
NULL a man | _ _ ein Mann | 2 | -4.795790545596741 |
NULL a my | groß _ ein _ | 2 | -Infinity |
NULL a man is | _ ist ein Mann | 1 | -7.886832998955057 |
NULL a man man | _ ist ein Mann | 1 | -Infinity |
NULL a man is big | groß ist ein Mann | 0 | -10.284728271753428 |
NULL a man is tall | groß ist ein Mann | 0 | -10.284728271753428 |
There are currently two contenders for the most likely translation. This is because the translation model is uncertain about the translation of "groß" which can be "tall" in the context of the height of humans, and "big" in most other settings. To avoid this uncertainty we can use a language model to capture the fact that "man is big" is a little less likely than "man is tall". We can reduce ambiguity by using a better language model
NULL | _ _ _ _ | 4 | 0.0 |
NULL a | _ _ ein _ | 3 | -2.4849066497880004 |
NULL man | groß _ _ _ | 3 | -Infinity |
NULL a man | _ _ ein Mann | 2 | -3.1780538303479458 |
NULL a tall | _ ist ein _ | 2 | -Infinity |
NULL a man is | _ ist ein Mann | 1 | -4.564348191467836 |
NULL a man NULL | _ ist ein Mann | 1 | -6.962243464266207 |
NULL a man is tall | groß ist ein Mann | 0 | -5.2574953720277815 |
NULL a man is big | groß ist ein Mann | 0 | -7.655390644826152 |
Note that "a man is tall" is also more likely in the Google N-grams corpus.
There are a few high level messages to take away from this chapter.