Maximum Likelihood Estimation

The Maximum Likelihood Estimator (MLE) is one of the simplest ways, and often most intuitive way, to determine the parameters of a probabilistic models based on some training data. Under favourable conditions the MLE has several useful properties. On such property is consistency: if you sample enough data from a distribution with certain parameters, the MLE will recover these parameters with arbitrary precision. In our structured prediction recipe MLE can be seen as the most basic form of continuous optimization for parameter estimation.

In this section we will focus on MLE for discrete distributions and continuous parameters. We will assume a distribution pθ(x) with x=(x1,,xn) that factorizes in the following way:

(1)pθ(x)=inpθ(xi|ϕi(x))=inθxi|ϕi(x)

Here the functions ϕi provide a context to condition the probability of xi with. For example, in a trigram language model this could be the bigram history for word i, and hence ϕi(x)=(xi1,xi2). Notice that this function should not consider the variable xi itself.

The Maximum Likelihood estimate θ for this model, given some training data Dtrain=(x1,,xn), is defined as the solution to the following optimization problem:

(2)θ=argmaxθpθ(Dtrain)=argmaxθlogpθ(Dtrain)

Here the second equality stems from the monotonicity of the log function, and is useful because the log expression is easier to optimize. In words, the maximum likelihood estimate are the parameters that assign maximal probability to the training sample.

As it turns out, the solution for 2 has a closed form: we can write the result as a direct function of Dtrain without the need of any iterative optimization algorithm. The result is simply:

(3)θx|ϕ=#Dtrain(x,ϕ)#Dtrain(ϕ)

where #Dtrain(x,ϕ) is the number of times we have seen the value x paired with the context ϕ in the data Dtrain, and #Dtrain(ϕ) the number of times we have seen the context ϕ.

Notice that in the same way we can represent the context of a variable using a function ϕ, and hence map contexts to more coarse-grained equivalence classes, we can map the values xi to a more coarse grained representation γ(xi). For example, in a language model we could decide to only care about the syntactic type (Verb, Noun, etc.) of a word and use γ(x)=syn-type(x). In this case the MLE only changes in the way we count: instead of counting the times we see x paired with the context ϕ, we count how often we see γ paired with the context ϕ.

Derivation

It is easy to derive the estimate for the discrete distributions described above. First let us reformulate the log-likelihood L in terms of dataset counts:

(4)L(Dtrain,θ)=logpθ(Dtrain)=x,ϕ#Dtrain(x,ϕ)logθx|ϕ

Next, remember that we want, for a given ϕ, the parameters θ,ϕ to represent a conditional probability distribution pθ(|ϕ). This requires positivity (which fall out naturally later), and crucially: a normalization constraint. In particular, we need xθx,ϕ=1.

We hence have to solve a constrained optimization problem. A Standard technique to solve such problems relies on the notion of the Lagrangian L: a version of the objective in which constraints are added as soft constraints weighted by the lagrange multipliers λ:

(5)L(θ,λ)=L(Dtrain,θ)+ϕλϕ(1xθx|ϕ)

If θ is a solution to the original optimization problem then there exist a set of multipliers λ such that θ,λ is a stationary point of L. By setting θL=0 and λL=0 we can find such points.

We first set θL=0:

(6)Lθx|ϕ=#Dtrain(x,ϕ)1θx|ϕλϕ=0

This means that each parameter needs to be proportional to the count of its corresponding event:

(7)θx|ϕ=#Dtrain(x,ϕ)λϕ

Setting set λL=0 will recover the original constraints: xθx|y=1. Plugging the above expression for θx|ϕ into this constraint will give us λϕ=x#Dtrain(x,ϕ)=#Dtrain(ϕ) and hence equation 3. Notice that there is only a single stationary point, and hence the parameters θ at this point need to be the optimal ones.