The Maximum Likelihood Estimator (MLE) is one of the simplest ways, and often most intuitive way, to determine the parameters of a probabilistic models based on some training data. Under favourable conditions the MLE has several useful properties. On such property is consistency: if you sample enough data from a distribution with certain parameters, the MLE will recover these parameters with arbitrary precision. In our structured prediction recipe MLE can be seen as the most basic form of continuous optimization for parameter estimation.
In this section we will focus on MLE for discrete distributions and continuous parameters. We will assume a distribution with that factorizes in the following way:
Here the functions provide a context to condition the probability of with. For example, in a trigram language model this could be the bigram history for word , and hence . Notice that this function should not consider the variable itself.
The Maximum Likelihood estimate for this model, given some training data , is defined as the solution to the following optimization problem:
Here the second equality stems from the monotonicity of the function, and is useful because the expression is easier to optimize. In words, the maximum likelihood estimate are the parameters that assign maximal probability to the training sample.
As it turns out, the solution for has a closed form: we can write the result as a direct function of without the need of any iterative optimization algorithm. The result is simply:
where is the number of times we have seen the value paired with the context in the data , and the number of times we have seen the context .
Notice that in the same way we can represent the context of a variable using a function , and hence map contexts to more coarse-grained equivalence classes, we can map the values to a more coarse grained representation . For example, in a language model we could decide to only care about the syntactic type (Verb, Noun, etc.) of a word and use . In this case the MLE only changes in the way we count: instead of counting the times we see paired with the context , we count how often we see paired with the context .
It is easy to derive the estimate for the discrete distributions described above. First let us reformulate the log-likelihood in terms of dataset counts:
Next, remember that we want, for a given , the parameters to represent a conditional probability distribution . This requires positivity (which fall out naturally later), and crucially: a normalization constraint. In particular, we need .
We hence have to solve a constrained optimization problem. A Standard technique to solve such problems relies on the notion of the Lagrangian : a version of the objective in which constraints are added as soft constraints weighted by the lagrange multipliers :
If is a solution to the original optimization problem then there exist a set of multipliers such that is a stationary point of . By setting and we can find such points.
We first set :
This means that each parameter needs to be proportional to the count of its corresponding event:
Setting set will recover the original constraints: . Plugging the above expression for into this constraint will give us and hence equation . Notice that there is only a single stationary point, and hence the parameters at this point need to be the optimal ones.