$\def\prob{p} \def\vocab{V} \def\params{\boldsymbol{\theta}} \def\argmax{\mathop{\arg\,\max}} \def\param{\theta} \def\bpi{\boldsymbol{\pi}} \def\balpha{\boldsymbol{\alpha}} \def\bbeta{\boldsymbol{\beta}} \def\perplexity{PP} \def\x{\mathbf{x}} \def\y{\mathbf{y}} \def\Xs{\mathcal{X}} \def\Ys{\mathcal{Y}} \def\perplexity{PP} \def\train{\mathcal{D}_\mathit{train}} \def\counts#1#2{\#_{#1}(#2)} \def\RR{\bf R} \def\length#1{\mathop{length}(#1)} \def\aligns{\mathbf{a}} \def\align{a} \def\source{\mathbf{s}} \def\ssource{s} \def\target{\mathbf{t}} \def\starget{t}$

Maximum Likelihood Estimation

The Maximum Likelihood Estimator (MLE) is one of the simplest ways, and often most intuitive way, to determine the parameters of a probabilistic models based on some training data. Under favourable conditions the MLE has several useful properties. On such property is consistency: if you sample enough data from a distribution with certain parameters, the MLE will recover these parameters with arbitrary precision. In our structured prediction recipe MLE can be seen as the most basic form of continuous optimization for parameter estimation.

In this section we will focus on MLE for discrete distributions and continuous parameters. We will assume a distribution $\prob_\params(\x)$ with $\x = (x_1,\ldots, x_n)$ that factorizes in the following way:

\begin{matrix} (1) & p_{θ} (x) = \prod_{i}^{n} p_{θ} (x_{i} | ϕ_{i} (x)) = \prod_{i}^{n} θ_{x_{i} | ϕ_{i} (x)} \end{matrix}

$\begin{equation} \prob_\params(\x) = \prod_i^n \prob_\params(x_i|\phi_i(\x)) = \prod_i^n \param_{x_i|\phi_i(\x)} \end{equation}$

Here the functions $\phi_i$ provide a context to condition the probability of $x_i$ with. For example, in a trigram language model this could be the bigram history for word $i$ , and hence $\phi_i(\x) = (x_{i-1},x_{i-2})$ . Notice that this function should not consider the variable $x_i$ itself.

The Maximum Likelihood estimate $\params^*$ for this model, given some training data $\train = (x_1,\ldots, x_n)$ , is defined as the solution to the following optimization problem:

\begin{matrix} (2) & θ^{*} = \underset{θ}{\arg max} p_{θ} (D_{t r a i n}) = \underset{θ}{\arg max} \log p_{θ} (D_{t r a i n}) \end{matrix}

$\begin{equation}\label{eq:mle} \params^* = \argmax_{\params} \prob_\params(\train) = \argmax_{\params} \log \prob_\params(\train) \end{equation}$

Here the second equality stems from the monotonicity of the $\log$ function, and is useful because the $\log$ expression is easier to optimize. In words, the maximum likelihood estimate are the parameters that assign maximal probability to the training sample.

As it turns out, the solution for $\ref{eq:mle}$ has a closed form: we can write the result as a direct function of $\train$ without the need of any iterative optimization algorithm. The result is simply:

\begin{matrix} (3) & θ_{x | ϕ} = \frac{#_{D_{t r a i n}} (x, ϕ)}{#_{D_{t r a i n}} (ϕ)} \end{matrix}

$\begin{equation}\label{eq:counts} \param_{x|\phi} = \frac{\counts{\train}{x,\phi}}{\counts{\train}{\phi}} \end{equation}$

where $\counts{\train}{x,\phi}$ is the number of times we have seen the value $x$ paired with the context $\phi$ in the data $\train$ , and $\counts{\train}{\phi}$ the number of times we have seen the context $\phi$ .

Notice that in the same way we can represent the context of a variable using a function $\phi$ , and hence map contexts to more coarse-grained equivalence classes, we can map the values $x_i$ to a more coarse grained representation $\gamma(x_i)$ . For example, in a language model we could decide to only care about the syntactic type (Verb, Noun, etc.) of a word and use $\gamma(x) = \mbox{syn-type}(x)$ . In this case the MLE only changes in the way we count: instead of counting the times we see $x$ paired with the context $\phi$ , we count how often we see $\gamma$ paired with the context $\phi$ .

Derivation

It is easy to derive the estimate for the discrete distributions described above. First let us reformulate the log-likelihood $L$ in terms of dataset counts:

\begin{matrix} (4) & L (D_{t r a i n}, θ) = \log p_{θ} (D_{t r a i n}) = \sum_{x, ϕ} #_{D_{t r a i n}} (x, ϕ) \log θ_{x | ϕ} \end{matrix}

$\begin{equation} \newcommand{\duals}{\boldsymbol{\lambda}} \newcommand{\lagrang}{\mathcal{L}} L(\train,\params) = \log \prob_\params(\train) = \sum_{x,\phi} \counts{\train}{x,\phi} \log \param_{x|\phi} \end{equation}$

Next, remember that we want, for a given $\phi$ , the parameters $\param_{\cdot,\phi}$ to represent a conditional probability distribution $\prob_\params(\cdot|\phi)$ . This requires positivity (which fall out naturally later), and crucially: a normalization constraint. In particular, we need $\sum_x \param_{x,\phi} = 1$ .

We hence have to solve a constrained optimization problem. A Standard technique to solve such problems relies on the notion of the Lagrangian $\lagrang$ : a version of the objective in which constraints are added as soft constraints weighted by the lagrange multipliers $\duals$ :

\begin{matrix} (5) & L (θ, λ) = L (D_{t r a i n}, θ) + \sum_{ϕ} λ_{ϕ} (1 - \sum_{x} θ_{x | ϕ}) \end{matrix}

$\begin{equation} \lagrang(\params,\duals) = L(\train,\params) + \sum_\phi \lambda_\phi (1 - \sum_x \param_{x|\phi}) \end{equation}$

If $\params^*$ is a solution to the original optimization problem then there exist a set of multipliers $\duals^*$ such that $\params^*,\duals^*$ is a stationary point of $\lagrang$ . By setting $\nabla_\params \lagrang = 0$ and $\nabla_\duals \lagrang = 0$ we can find such points.

We first set $\nabla_\params \lagrang = 0$ :

\begin{matrix} (6) & \frac{\partial L}{\partial θ_{x | ϕ}} = #_{D_{t r a i n}} (x, ϕ) \frac{1}{θ_{x | ϕ}} - λ_{ϕ} = 0 \end{matrix}

$\begin{equation} \frac{\partial \lagrang}{\partial \param_{x|\phi}} = \counts{\train}{x,\phi} \frac{1}{\param_{x|\phi}} - \lambda_\phi = 0 \end{equation}$

This means that each parameter needs to be proportional to the count of its corresponding event:

\begin{matrix} (7) & θ_{x | ϕ} = \frac{#_{D_{t r a i n}} (x, ϕ)}{λ_{ϕ}} \end{matrix}

$\begin{equation} \param_{x|\phi} = \frac{\counts{\train}{x,\phi}}{\lambda_\phi} \end{equation}$

Setting set $\nabla_\duals \lagrang = 0$ will recover the original constraints: $\sum_x \param_{x|y} = 1$ . Plugging the above expression for $\param_{x|\phi}$ into this constraint will give us $\lambda_\phi = \sum_x \counts{\train}{x,\phi} = \counts{\train}{\phi}$ and hence equation $\ref{eq:counts}$ . Notice that there is only a single stationary point, and hence the parameters $\params$ at this point need to be the optimal ones.

StatNLP

StatNLP

Maximum Likelihood Estimation

Derivation