$\def\prob{p} \def\vocab{V} \def\params{\boldsymbol{\theta}} \def\argmax{\mathop{\arg\,\max}} \def\param{\theta} \def\bpi{\boldsymbol{\pi}} \def\balpha{\boldsymbol{\alpha}} \def\bbeta{\boldsymbol{\beta}} \def\perplexity{PP} \def\x{\mathbf{x}} \def\y{\mathbf{y}} \def\Xs{\mathcal{X}} \def\Ys{\mathcal{Y}} \def\perplexity{PP} \def\train{\mathcal{D}_\mathit{train}} \def\counts#1#2{\#_{#1}(#2)} \def\RR{\bf R} \def\length#1{\mathop{length}(#1)} \def\aligns{\mathbf{a}} \def\align{a} \def\source{\mathbf{s}} \def\ssource{s} \def\target{\mathbf{t}} \def\starget{t}$

Text Classification

In many applications we need to automatically classify some input text with respect to a set of classes or labels. For example,

for information retrieval it is useful to classify documents into a set of topics, such as "sport" or "business",
for sentiment analysis we classify tweets into being "positive" or "negative" and
for Spam Filters we need to distinguish between Ham and Spam.

Text Classification as Structured Prediction

We can formalize text classification as the simplest instance of structured prediction where the input space $\Xs$ are sequences of words, and the output space $\Ys$ is a set of labels such as $\Ys=\{ \text{sports},\text{business}\}$ . On a high level, our goal is to define a model a model $s_{\params}(\x,y)$ that assigns high scores to the label $y$ that fits the text $\x$ , and lower scores otherwise. The model will be parametrized by $\params$ , and these parameters we will learn from some training set $\train$ of $(\x,y)$ pairs. When we need to classify a text $\x$ we have to solve the trivial (if the number of classes is low) maximization problem $\argmax_y s_{\params}(\x,y)$ .

In the following we will present two typical approaches to text classifiers: Naive Bayes and discriminative linear classifiers. We will also see that both in fact can use the same model structure, and differ only in how model parameters are trained.

Naive Bayes

One of the most widely used approaches to text classification relies on the so-called Naive Bayes (NB) Model. In NB we use a distribution $p^{\mbox{NB}}_{\params}$ for $s_\params$ . In particular, we use the a posteriori probability of a label $y$ given the input text $\x$ as a score for that label given the text.

\begin{matrix} (1) & s_{θ} (x, y) = p_{θ}^{NB} (y | x) \end{matrix}

$\begin{equation} s_{\params}(\x,\y)\ = p^{\text{NB}}_{\params}(y|\x) \end{equation}$

By Bayes Law we get

\begin{matrix} (2) & p_{θ}^{NB} (y | x) = \frac{p_{θ}^{NB} (x | y) p_{θ}^{NB} (y)}{p_{θ}^{NB} (x)} \end{matrix}

$\begin{equation} p^{\text{NB}}_{\params}(y|\x) = \frac{p^{\text{NB}}_{\params}(\x|y) p^\text{NB}_{\params}(y)}{p^{\text{NB}}_{\params}(x)} \end{equation}$

and when an input $\x$ is fixed we can focus on

\begin{matrix} (3) & p_{θ}^{NB} (x, y) = p_{θ}^{NB} (x | y) p_{θ}^{NB} (y) \end{matrix}

$\begin{equation}\label{eq:NB} \prob^{\text{NB}}_{\params}(\x,y)= p^{\text{NB}}_{\params}(\x|y) p^\text{NB}_{\params}(y) \end{equation}$

because in this case $p^{\text{NB}}_{\params}(x)$ is a constant factor. In the above $p^{\text{NB}}_{\params}(\x|y)$ is the likelihood, and $p^\text{NB}_{\params}(y)$ is the prior.

The "naivity" of NB stems from a certain conditional independence assumption we make for the likelihood $p^{\mbox{NB}}_{\params}(\x|y)$ . Note that conditional independence of two events $a$ and $b$ given a third event $c$ requires that $p(a,b|c) = p(a|c) p(b|c)$ . In particular, for the likelihood in NB we have:

\begin{matrix} (4) & p_{θ}^{NB} (x | y) = \prod_{i}^{length (x)} p_{θ}^{NB} (x_{i} | y) \end{matrix}

$\begin{equation} p^{\text{NB}}_{\params}(\x|y) = \prod_i^{\text{length}(\x)} p^{\text{NB}}_{\params}(x_i|y) \end{equation}$

That is, NB makes the assumption that the observed wors are independent of each other when conditioned on the label $y$ .

Parametrization

The NB model has the parameters $\params=(\balpha,\bbeta)$ where

\begin{aligned} p_{θ}^{NB} (f | y) & = α_{f, y} \\ p_{θ}^{NB} (y) & = β_{y} . \end{aligned}

$\begin{split} p^{\text{NB}}_{\params}(f|y) & = \alpha_{f,y} \\ p^{\text{NB}}_{\params}(y) & = \beta_{y}. \end{split}$

That is, $\balpha$ captures the per-class feature weights, and $\bbeta$ the class priors.

Training the Naive Bayes Model

The NB model again can be trained using Maximum Likelihood estimation. This amounts to setting

\begin{aligned} α_{x, y} & = \frac{#_{D_{t r a i n}} (x, y)}{#_{D_{t r a i n}} (y)} \\ β_{y} & = \frac{#_{D_{t r a i n}} (y)}{| D_{t r a i n} |} \end{aligned}

$\begin{split} \alpha_{x,y} & = \frac{\counts{\train}{x,y}}{\counts{\train}{y}}\\ \beta_{y} & = \frac{\counts{\train}{y}}{\left| \train \right|} \end{split}$

Log-linear Representation

It will be convenient to represent the NB model in log-linear form. This form will allow us to understand the MLE NB simply as one particular way of training a (log) linear model. It will also make the approach comparable to other approaches such as Conditional Log-Likelihood (aka logistic regression or maximum entropy) or SVM style training that can operate on the same parametrisation but estimate the parameters using different objectives. Finally, the log-linear representation will make it easy to implement different training algorithms that work with the same representation and hence can be easily plugged-in and out.

In log-linear form the joint NB distribution $p^{\text{NB}}_{\params}(\x,y)$ can be written as:

\begin{matrix} (5) & p_{θ}^{NB} (x, y) = \exp (\sum_{i \in I} f_{i} (x, y) w_{i}) = \exp < f (x, y), w > \end{matrix}

$\begin{equation}\label{eq-loglinear} p^{\text{NB}}_{\params}(\x,y)= \exp \left( \sum_{i \in \mathcal{I}} f_i(x,y) w_i \right) = \exp < \mathbf{f}(\x,y), \mathbf{w}> \end{equation}$

Here the $f_i$ are so called (joint) feature functions. The index set $\mathcal{I}$ is the union of all labels $y'\in \Ys$ and all word-label pairs $(x',y')$ , and the corresponding feature functions are defined as follows:

\begin{aligned} f_{y^{'}} (x, y) & = δ (y, y^{'}) \\ f_{x^{'}, y^{'}} (x, y) & = δ (y, y^{'}) \sum_{i}^{length (x)} δ (x^{'}, x_{i}) \end{aligned}

$\begin{split} f_{y'}(\x,y) & = \delta({y,y'}) \\ f_{x',y'}(\x,y) & = \delta({y,y'}) \sum_i^{\text{length}(\x)} \delta(x',x_i) \end{split}$

In words, the first feature function $f_{y'}$ returns 1 if the input label $y$ equals $y'$ , and 0 otherwise. The second feature function returns the number of times the word $x'$ appears in $\x$ if $y$ equals $y'$ , and 0 otherwise.

If one now sets the weights according to

\begin{aligned} w_{y^{'}} & = \log β_{y^{'}} \\ w_{x^{'}, y^{'}} & = \log α_{x^{'}, y^{'}} \end{aligned}

$\begin{split} w_{y'} & = \log \beta_{y'}\\ w_{x',y'} & = \log \alpha_{x',y'} \end{split}$

it is easy to show that $\ref{eq-loglinear}$ is equivalent to the original NB formulation in equation $\ref{eq:NB}$ .

Feature Templates

Note that in the above case we have $|\Ys|$ label features, and $|V| \times |\Ys|$ word-label features. The corresponding two types of features are often called feature templates that generate sets of actual feature functions. That is, $f_{y'}$ is a feature template with one template argument, $y'$ , and $f_{x'y'}$ is a template with two arguments. It is common to augment the feature templates with a template name to distinguish templates that have the same argument space but different semantics. For example, we may have a template that combines the label of a document with number of times word stems have been seen. For words $x'$ that are their own stem this would create duplicate indices, and hence we use $f_{\text{word},x',y'}$ and $f_{\text{stem},x',y'}$ to distinguish both types of features.

Conditional Model

The conditional probability $p_\params^\text{NB}(y|\x)$ can be written as

\begin{aligned} p_{θ}^{NB} (y | x) = \frac{\exp < f (x, y), w >}{\sum_{y^{'}} \exp < f (x, y^{'}), w >} = \exp (s_{θ} (x, y) - A_{θ, x}) \end{aligned}

$\begin{split} p^{\text{NB}}_{\params}(y |\x)= \frac{\exp < \mathbf{f}(\x,y), \mathbf{w}>}{\sum_{y'} \exp < \mathbf{f}(\x,y'), \mathbf{w}>} = \exp \left( s_\params(\x,y) - A_{\params,\x}\right) \end{split}$

where

\begin{aligned} s_{θ} (x, y) & =< f (x, y^{'}), w > \\ A_{θ, x} & = \sum_{y^{'}} \exp < f (x, y^{'}), w > \end{aligned}

$\begin{split} s_\params(\x,y) & = < \mathbf{f}(\x,y'), \mathbf{w}> \\ A_{\params,\x} & = \sum_{y'} \exp < \mathbf{f}(\x,y'), \mathbf{w}> \end{split}$

and $A_{\params,\x}$ is the log-partition function.

Joint vs Input Features

In contrast to the standard formulation of linear and log-linear models we define feature functions over input-output pairs instead of only inputs. That is, we could have also used a representation where $< \mathbf{f}(\x,y), \mathbf{w}>$ is replaced by $< \mathbf{f}(\x), \mathbf{w}_y>$ in which each class $y$ receives an own weight vector.

The benefit of the joint feature function is two-fold. First it enables us to easily define features that break up the output label into sub-labels. That is, say you have the labels "sports_baseball" and "sports_football", then you can define one feature that tests whether a label starts with a certain prefix ("sports"). This allows the model to learn commonalities between both labels. Second, the one-weight vector per class approach breaks down when the output space is structured (and exponentially sized). You simply cannot maintain a different weight vector for each possible output structure $\y$ , both for computational and statistical reasons.

Conditional Loglikelihood

Training the NB model using MLE is efficient and easy to implement. In many cases it also leads to good results. However, one can argue that the MLE objective is generally not the optimal choice when the task is to predict the best output given some input. Let us recall the MLE objective:

\begin{aligned} L (D_{t r a i n}, θ) = \sum_{(x, y) \in D_{t r a i n}} \log p_{θ}^{NB} (x, y) \end{aligned}

$\begin{split} L(\train, \params) = \sum_{(\x,y) \in \train} \log p^{\text{NB}}_\params(\x,y) \end{split}$

Via Bayes Law we can reformulate this objective as follows:

\begin{aligned} L (D_{t r a i n}, θ) = \sum_{(x, y) \in D_{t r a i n}} \log p_{θ}^{NB} (y | x) + \log p_{θ}^{NB} (x) \end{aligned}

$\begin{split} L(\train, \params) = \sum_{(\x,y) \in \train} \log p^{\text{NB}}_\params(y|\x) + \log p^{\text{NB}}_\params(\x) \end{split}$

Notice that the $p^{\text{NB}}_\params(\x)$ here is not the class prior as used in the forward definition of the NB model, $p^{\text{NB}}_\params(\x|y) p^{\text{NB}}_\params(y)$ . Instead it is the marginal probability $\sum_y p^{\text{NB}}_\params(\x,y)$ of seeing a given input text $\x$ .

This view on the MLE objective shows that for every training instance MLE means both maximizing the conditional probability of seeing $y$ given $\x$ , and maximizing the marginal probability of $\x$ . Now consider the way we use the NB model to make a prediction for a given $\x$ : we search for the label $y$ with maximum conditional probability $p^{\text{NB}}_\params(y|\x)$ . Since $\x$ is fixed, the marginal probability $p^{\text{NB}}_\params(\x)$ has no impact on this decision. This means part of what what we optimize during training is completely irrelevant at test time.

The above problem may lead to errors: for a given training instance we may increase the marginal probability of $\x$ at the price of reducing $p^{\text{NB}}_\params(y|\x)$ if this means we can still increase their product. This may lead to the true $y$ having smaller conditional probability than wrong ones.

One way to overcome this problem is to directly optimize the conditional log-likelihood (CL). This objective is defined as follows:

\begin{aligned} C L (D_{t r a i n}, θ) = \sum_{(x, y) \in D_{t r a i n}} \log p_{θ}^{NB} (y | x) = \sum_{(x, y) \in D_{t r a i n}} s_{θ} (x, y) - A_{θ, x} \end{aligned}

$\begin{split} CL(\train, \params) = \sum_{(\x,y) \in \train} \log p^{\text{NB}}_\params(y|\x) = \sum_{(\x,y) \in \train} s_\params(\x,y) - A_{\params,\x} \end{split}$

This objective directly aims to maximize the condional class probability given the inputs, ignoring the marginal probability of $\x$ .

Optimizing Conditional Loglikelihood

Due to the normalization factor $A_{\params,\x}$ there is no closed-form solution to the CL problem, and compared to the MLE solution we cannot just count. Instead we need to iteratively optimize the objective. One popular method to optimize the CL objective is via gradient ascent, or its stochastic version, stochastic gradient ascent Often one uses the term gradient descent even when ascent is meant, and we then speak of stochastic gradient descent (SGD).

The underlying idea of such gradient-based methods is to iteratively move along the gradient of the function until this gradient disappears. That is, for a function $f(\params)$ to be optimized, in each step $i$ a gradient ascent algorithm performs the following update to the parameter vector $\params$ :

\begin{aligned} θ_{i} \leftarrow θ_{i - 1} + α_{i} \nabla f (θ_{i - 1}) \end{aligned}

$\begin{split} \params_i \leftarrow \params_{i-1} + \alpha_i \nabla f(\params_{i-1}) \end{split}$

There are various ways of choosing the learning rate $\alpha$ dynamically depending on the iteration $i$ , but for simplicity we will set it to 1 in the following.

To apply gradient ascent methods to optimizing the conditional loglikelihood we need to find its gradient with respect to the parameters. This gradient has a very intuitive form:

\begin{aligned} \nabla C L (θ) = \sum_{(x, y) \in D_{t r a i n}} f (x, y) - E_{p_{θ} (y^{'} | x)} [f (x, y^{'})] \end{aligned}

$\begin{split} \nabla CL(\params) = \sum_{(\x,y) \in \train} \mathbf{f}(\x,y) - E_{p_\params(y'|\x)}[ \mathbf{f}(\x,y')] \end{split}$

That is, for each training instance this gradient moves towards the feature representation $\mathbf{f}(\x,y)$ of the gold solution, and away from the expectation of the feature representation under the current model $E_{p_\params(y'|\x)}[ \mathbf{f}(\x,y')]$ .

In particular, this gradient is zero (and hence at an optimal solution) when the empirical expectation of the feature function under the training set is identical to the (conditional) model expectation. This gives rise to a dual view of the conditional likelihood objective: we can see solutions to this problem as parameters that force the empirical and model moments to match. In fact, one can arrive at the log-linear formulation and the CL objective also by searching for any distribution that matches the given feature moments and has maximal entropy. Particularly in the context of text classification the CL based approach is therefore often referred to as Maximum Entropy approach. On many text classification datasets it can be shown to outperform the MLE approach substantially.

Note that the CL objective is strictly concave. This means that when the gradient becomes zero we have found the single optimum to this function.

StatNLP

StatNLP

Text Classification