Word-based Machine Translation

MT is one of the most widely used NLP application. It is an example of end-to-end NLP, and MT-like architectures can be found in many other applications.

We focus on word-based MT which is not the state-of-the-art anymore, but a foundation and blueprint for modern mechanisms.

MT as Structured Prediction

Define a parametrised model sθ(t,s) measuring how well a target sentence t matches a source sentence s, learn the parameters θ from training data, and find

(1)argmaxtsθ(t,s)

Noisy Channel Model for MT

A generative process (the speaker's brain):

  • generate target sentence t according to p(t).
  • Transmit t through a noisy channel p(s|t) to generate source s.

For a graphical view on the noisy channel see my slides.

The distribution over target words and the channel noise model define a joint distribution p(s,t)=p(t)p(s|t). To use it for translation we operate it backwards:

(2)t=argmaxtp(t|s)=argmaxtp(t)p(s|t).

For the structured prediction recipe this means setting

sθ(t,s)=p(t)p(s|t).

In Machine Translation:

  • p(t) is usually referred to as language model.
  • p(s|t) is referred to as translation model.

Here we focus on translation models.

A Naive Baseline Translation Model

The most straightforward translation model translates words one-by-one, in the order of appearance:

pθNaive(s|t)=ilength(s)θsi,ti
where θs,t is the probability of translating t as s. θ is often referred to as translation table.

Using parallel data of paired source and target sentences, we can train the model using the Maximum Likelihood Estimator:

θs,t=#Dtrain(s,t)#Dtrain(t)

Here #Dtrain(s,t) is the number of times we see target word t translated as source word s, and #Dtrain(t) the number of times we the target word t in total.

Training the Naive Model

Let us prepare some toy data to show how train this naive model.

4

We can train the naive model as follows.

Let us train on the toy dataset:

(das,is)(mein,is)(ist,is)0.00.20.40.00.50.250.250.50

Decoding with the Naive Model

Translation is often called decoding and means:

argmaxtp(t)p(s|t)
How can we achieve this for the Naive Model?

my house the small

The naive model is broken in several ways. Most severely, it ignores the fact that word order can differ and still yield (roughly) the same meaning.

IBM Model 2

IBM Model 2 is one of the most influential translation models, even though these days it is only indirectly used in actual MT systems, for example to initialize translation and alignment models.

Alignment

The core difference of Model 2 is the introduction of latent auxiliary variables a:

  • ai[0length(t)] for each source token i[1length(s)]
  • ai=j means source token i is aligned with the target token j.
  • ai can be 0 to point to a NULL token to omit source words.

Below you see a simple example of an alignment.

NULLthehouseissmallkleinistdasHaus
NULL小さいいですThehouseissmall

Model 2 defines a distribution p(s,a|t). A translation model can be derived by marginalizing out the alignments

p(s|t)=ap(s,a|t).

Model Parametrization

IBM Model 2 has two sets of parameters θ=(α,β)

  • α(s|t): probability of translation target word t into source word s.
  • β(j|i,lt,ls): probability of aligning the source token i with target token j, conditioned on the target length lt and source length ls.

Model 2 defines a conditional distribution over source sentences and alignments, conditioned on a target sentence and a desired source sentence length

(3)pθIBM2(s1sls,a1als|t1tlt,ls)=ilsα(si|tai)β(ai|i,lt,ls)

Training IBM Model 2 with the EM Algorithm

Training Model 2 is challenging:

  • we have gold sentence alignments
  • but no gold word alignments!

The Expectation Maximization (EM) Algorithm is a standard training algorithm for partially observed data. It maximizes a lower bound of the likelihood

(ti,si)DtrainlogpθIBM2(si|ti)=(ti,si)DtrainlogapθIBM2(si,a|ti)

EM can be be seen as block coordinate descent on this bound.

EM for Model 2 iterates between:

  • E-Step: given a current set of parameters θ, calculate the expectations π of the latent alignment variables under the model pθIBM2 — this amounts to estimating a soft alignment for each sentence.
  • M-Step: Given training set of soft alignments π, find new parameters θ that maximize the log likelihood of this (weighted) training set. This amounts to soft counting.

E-Step

  • Calculate distribution over alignments

π(a|s,t)=pθIBM2(a|s,t)

  • Distribution factorizes:

π(a|s,t)=ilsπ(ai|s,t,i)=ilsα(si|tai)β(ai|i,lt,ls)jltα(si|tj)β(j|i,lt,ls)

First we create some toy data:

We can now implement the E-Step.

Let us run this code:

NULLthehouseissmallkleinistdasHaus

You can play around with the initialization of β to see how the alignments react to changes of the word-to-word translation probabilities.

M-Step

  • Calculate weighted maximum likelihood estimate:

θ=argmaxθ(t,s)Dtrainaπ(a|t,s)logpθIBM2(s,a|t)

  • Closed form solution (weighted counting):

α(s|t)=(t,s)ilsjltπ(j|i)δ(s,si)δ(t,tj)(t,s)jltδ(t,tj)

δ(x,y) is 1 if x=y and 0 otherwise

Let us implement the M-Step now. In this step we estimate parameters θ from a given set of (soft) alignments a.

Let us run one M-Step on the alignments we estimated earlier.

(Haus,is)(ein,is)(groß,is)(mein,is)(das,is)(klein,is)(lang,is)(Gebäude,is)(Mann,is)(ist,is)0.00.10.20.00.20.100.050.100.050.150.100.050.100.050.25

Notice that the algorithm already figured out that "is" is most likely translated to "ist". This is because it is (softly) aligned with "is" in every sentence, whereas other German words only appear in a subset of the sentences.

Initialization (IBM Model 1)

Due to non-convexity of the EM bound, good initialization is crucial.

We can train IBM Model 1, which fixes the distortion parameters ββ to be uniform:

β(ai|i,lt,ls)=1lt+1β(ai|i,lt,ls)=1lt+1

After training the parameters θθ of Model 1 can be used to initialize EM for Model 2.

EM for IBM Model 1 converges to a global optimum.

Let us train IBM Model 1 now. This amounts to using our previous eStep and mStep methods, initializing ββ as above and not updating it during mStep.

You can see below that the alignments converge relatively quickly.

20.0040.0060.0080.00X0.0099.000.020.040.060.080.100.120.140.160.18Y0.000.20Label
Exercise 1
Can you think of other reasonable measures for convergence? Hint: consider the formal derivation of EM.

Let us have a look at the translation table.

(mein,house)(das,house)(Haus,house)(ist,house)(klein,house)0.00.20.40.00.50.000.000.500.000.50

We can also inspect the alignments generated during EM.

NULLthehouseissmallkleinistdasHaus

Training IBM Model 2

Now that we have a reasonable initial model we can use it to initialize EM for IBM Model 2. Here is the EM code in full.

Initializing with the IBM Model 1 result gives us:

(mein,house)(das,house)(Haus,house)(ist,house)(klein,house)0.00.51.00.01.00.000.001.000.000.00

For alignments we get:

NULLthehouseissmallkleinistdasHaus

Let us look at the distortion probabilities for a given source position and source and target lengths.

012340.00.51.00.01.00.000.000.000.001.00

Decoding for IBM Model 2

Decoding IBM Model 2 ideally means:

(4)argmaxtpθIBM2(s|t)=argmaxtapθIBM2(s,a|t)argmaxtpIBM2θ(s|t)=argmaxtapIBM2θ(s,a|t)(4)

This nested argmax/sum is generally computationally very hard. It's easier to argmax over target and alignments:

(5)argmaxt,apθIBM2(s,a|t)argmaxt,apIBM2θ(s,a|t)(5)

A simplified same-length decoder in Scala, using a beam:

Let us test this decoder on a simple sentence, using a uniform language model.

NULL _ _ _ _ 4 0.0
NULL a _ _ ein _ 3 -2.3978952727983707
NULL man groß _ _ _ 3 -Infinity
NULL a man _ _ ein Mann 2 -4.795790545596741
NULL a my groß _ ein _ 2 -Infinity
NULL a man is _ ist ein Mann 1 -7.886832998955057
NULL a man man _ ist ein Mann 1 -Infinity
NULL a man is big groß ist ein Mann 0 -10.284728271753428
NULL a man is tall groß ist ein Mann 0 -10.284728271753428

We can reduce ambiguity by using a better language model

NULL _ _ _ _ 4 0.0
NULL a _ _ ein _ 3 -2.4849066497880004
NULL man groß _ _ _ 3 -Infinity
NULL a man _ _ ein Mann 2 -3.1780538303479458
NULL a tall _ ist ein _ 2 -Infinity
NULL a man is _ ist ein Mann 1 -4.564348191467836
NULL a man NULL _ ist ein Mann 1 -6.962243464266207
NULL a man is tall groß ist ein Mann 0 -5.2574953720277815
NULL a man is big groß ist ein Mann 0 -7.655390644826152

Note that "a man is tall" is also more likely in the Google N-grams corpus.

Summary

There are a few high level messages to take away from this chapter.

  • MT is an instance structured prediction recipe
  • The noisy channel is one modeling framework
  • word-based MT is foundation and blue print for more complex models
  • Training with EM
  • NLP Tricks:
    • introducing latent alignment variables to simplify problem
    • decoding with Beams

Background Material