MT is one of the most widely used NLP application. It is an example of end-to-end NLP, and MT-like architectures can be found in many other applications.
We focus on word-based MT which is not the state-of-the-art anymore, but a foundation and blueprint for modern mechanisms.
Define a parametrised model measuring how well a target sentence matches a source sentence , learn the parameters from training data, and find
A generative process (the speaker's brain):
For a graphical view on the noisy channel see my slides.
The distribution over target words and the channel noise model define a joint distribution . To use it for translation we operate it backwards:
For the structured prediction recipe this means setting
In Machine Translation:
Here we focus on translation models.
The most straightforward translation model translates words one-by-one, in the order of appearance:
Using parallel data of paired source and target sentences, we can train the model using the Maximum Likelihood Estimator:
Here is the number of times we see target word translated as source word , and the number of times we the target word in total.
Let us prepare some toy data to show how train this naive model.
We can train the naive model as follows.
Let us train on the toy dataset:
Translation is often called decoding and means:
The naive model is broken in several ways. Most severely, it ignores the fact that word order can differ and still yield (roughly) the same meaning.
IBM Model 2 is one of the most influential translation models, even though these days it is only indirectly used in actual MT systems, for example to initialize translation and alignment models.
The core difference of Model 2 is the introduction of latent auxiliary variables :
Below you see a simple example of an alignment.
Model 2 defines a distribution . A translation model can be derived by marginalizing out the alignments
IBM Model 2 has two sets of parameters
Model 2 defines a conditional distribution over source sentences and alignments, conditioned on a target sentence and a desired source sentence length
Training Model 2 is challenging:
The Expectation Maximization (EM) Algorithm is a standard training algorithm for partially observed data. It maximizes a lower bound of the likelihood
EM can be be seen as block coordinate descent on this bound.
EM for Model 2 iterates between:
First we create some toy data:
We can now implement the E-Step.
Let us run this code:
You can play around with the initialization of to see how the alignments react to changes of the word-to-word translation probabilities.
is 1 if and 0 otherwise
Let us implement the M-Step now. In this step we estimate parameters from a given set of (soft) alignments .
Let us run one M-Step on the alignments we estimated earlier.
Notice that the algorithm already figured out that "is" is most likely translated to "ist". This is because it is (softly) aligned with "is" in every sentence, whereas other German words only appear in a subset of the sentences.
Due to non-convexity of the EM bound, good initialization is crucial.
We can train IBM Model 1, which fixes the distortion parameters
After training the parameters
EM for IBM Model 1 converges to a global optimum.
Let us train IBM Model 1 now. This amounts to using our previous eStep
and mStep
methods, initializing mStep
.
You can see below that the alignments converge relatively quickly.
Let us have a look at the translation table.
We can also inspect the alignments generated during EM.
Now that we have a reasonable initial model we can use it to initialize EM for IBM Model 2. Here is the EM code in full.
Initializing with the IBM Model 1 result gives us:
For alignments we get:
Let us look at the distortion probabilities for a given source position and source and target lengths.
Decoding IBM Model 2 ideally means:
This nested argmax/sum is generally computationally very hard. It's easier to argmax over target and alignments:
A simplified same-length decoder in Scala, using a beam:
Let us test this decoder on a simple sentence, using a uniform language model.
NULL | _ _ _ _ | 4 | 0.0 |
NULL a | _ _ ein _ | 3 | -2.3978952727983707 |
NULL man | groß _ _ _ | 3 | -Infinity |
NULL a man | _ _ ein Mann | 2 | -4.795790545596741 |
NULL a my | groß _ ein _ | 2 | -Infinity |
NULL a man is | _ ist ein Mann | 1 | -7.886832998955057 |
NULL a man man | _ ist ein Mann | 1 | -Infinity |
NULL a man is big | groß ist ein Mann | 0 | -10.284728271753428 |
NULL a man is tall | groß ist ein Mann | 0 | -10.284728271753428 |
We can reduce ambiguity by using a better language model
NULL | _ _ _ _ | 4 | 0.0 |
NULL a | _ _ ein _ | 3 | -2.4849066497880004 |
NULL man | groß _ _ _ | 3 | -Infinity |
NULL a man | _ _ ein Mann | 2 | -3.1780538303479458 |
NULL a tall | _ ist ein _ | 2 | -Infinity |
NULL a man is | _ ist ein Mann | 1 | -4.564348191467836 |
NULL a man NULL | _ ist ein Mann | 1 | -6.962243464266207 |
NULL a man is tall | groß ist ein Mann | 0 | -5.2574953720277815 |
NULL a man is big | groß ist ein Mann | 0 | -7.655390644826152 |
Note that "a man is tall" is also more likely in the Google N-grams corpus.
There are a few high level messages to take away from this chapter.