In many applications we need to automatically classify some input text with respect to a set of classes or labels. For example,
We can formalize text classification as the simplest instance of structured prediction where the input space are sequences of words, and the output space is a set of labels such as . On a high level, our goal is to define a model a model that assigns high scores to the label that fits the text , and lower scores otherwise. The model will be parametrized by , and these parameters we will learn from some training set of pairs. When we need to classify a text we have to solve the trivial (if the number of classes is low) maximization problem .
In the following we will present two typical approaches to text classifiers: Naive Bayes and discriminative linear classifiers. We will also see that both in fact can use the same model structure, and differ only in how model parameters are trained.
One of the most widely used approaches to text classification relies on the so-called Naive Bayes (NB) Model. In NB we use a distribution for . In particular, we use the a posteriori probability of a label given the input text as a score for that label given the text.
By Bayes Law we get
and when an input is fixed we can focus on
because in this case is a constant factor. In the above is the likelihood, and is the prior.
The "naivity" of NB stems from a certain conditional independence assumption we make for the likelihood . Note that conditional independence of two events and given a third event requires that . In particular, for the likelihood in NB we have:
That is, NB makes the assumption that the observed wors are independent of each other when conditioned on the label .
The NB model has the parameters where
That is, captures the per-class feature weights, and the class priors.
The NB model again can be trained using Maximum Likelihood estimation. This amounts to setting
It will be convenient to represent the NB model in log-linear form. This form will allow us to understand the MLE NB simply as one particular way of training a (log) linear model. It will also make the approach comparable to other approaches such as Conditional Log-Likelihood (aka logistic regression or maximum entropy) or SVM style training that can operate on the same parametrisation but estimate the parameters using different objectives. Finally, the log-linear representation will make it easy to implement different training algorithms that work with the same representation and hence can be easily plugged-in and out.
In log-linear form the joint NB distribution can be written as:
Here the are so called (joint) feature functions. The index set is the union of all labels and all word-label pairs , and the corresponding feature functions are defined as follows:
In words, the first feature function returns 1 if the input label equals , and 0 otherwise. The second feature function returns the number of times the word appears in if equals , and 0 otherwise.
If one now sets the weights according to
it is easy to show that is equivalent to the original NB formulation in equation .
Note that in the above case we have label features, and word-label features. The corresponding two types of features are often called feature templates that generate sets of actual feature functions. That is, is a feature template with one template argument, , and is a template with two arguments. It is common to augment the feature templates with a template name to distinguish templates that have the same argument space but different semantics. For example, we may have a template that combines the label of a document with number of times word stems have been seen. For words that are their own stem this would create duplicate indices, and hence we use and to distinguish both types of features.
The conditional probability can be written as
where
and is the log-partition function.
In contrast to the standard formulation of linear and log-linear models we define feature functions over input-output pairs instead of only inputs. That is, we could have also used a representation where is replaced by in which each class receives an own weight vector.
The benefit of the joint feature function is two-fold. First it enables us to easily define features that break up the output label into sub-labels. That is, say you have the labels "sports_baseball" and "sports_football", then you can define one feature that tests whether a label starts with a certain prefix ("sports"). This allows the model to learn commonalities between both labels. Second, the one-weight vector per class approach breaks down when the output space is structured (and exponentially sized). You simply cannot maintain a different weight vector for each possible output structure , both for computational and statistical reasons.
Training the NB model using MLE is efficient and easy to implement. In many cases it also leads to good results. However, one can argue that the MLE objective is generally not the optimal choice when the task is to predict the best output given some input. Let us recall the MLE objective:
Via Bayes Law we can reformulate this objective as follows:
Notice that the here is not the class prior as used in the forward definition of the NB model, . Instead it is the marginal probability of seeing a given input text .
This view on the MLE objective shows that for every training instance MLE means both maximizing the conditional probability of seeing given , and maximizing the marginal probability of . Now consider the way we use the NB model to make a prediction for a given : we search for the label with maximum conditional probability . Since is fixed, the marginal probability has no impact on this decision. This means part of what what we optimize during training is completely irrelevant at test time.
The above problem may lead to errors: for a given training instance we may increase the marginal probability of at the price of reducing if this means we can still increase their product. This may lead to the true having smaller conditional probability than wrong ones.
One way to overcome this problem is to directly optimize the conditional log-likelihood (CL). This objective is defined as follows:
This objective directly aims to maximize the condional class probability given the inputs, ignoring the marginal probability of .
Due to the normalization factor there is no closed-form solution to the CL problem, and compared to the MLE solution we cannot just count. Instead we need to iteratively optimize the objective. One popular method to optimize the CL objective is via gradient ascent, or its stochastic version, stochastic gradient ascent Often one uses the term gradient descent even when ascent is meant, and we then speak of stochastic gradient descent (SGD).
The underlying idea of such gradient-based methods is to iteratively move along the gradient of the function until this gradient disappears. That is, for a function to be optimized, in each step a gradient ascent algorithm performs the following update to the parameter vector :
There are various ways of choosing the learning rate dynamically depending on the iteration , but for simplicity we will set it to 1 in the following.
To apply gradient ascent methods to optimizing the conditional loglikelihood we need to find its gradient with respect to the parameters. This gradient has a very intuitive form:
That is, for each training instance this gradient moves towards the feature representation of the gold solution, and away from the expectation of the feature representation under the current model .
In particular, this gradient is zero (and hence at an optimal solution) when the empirical expectation of the feature function under the training set is identical to the (conditional) model expectation. This gives rise to a dual view of the conditional likelihood objective: we can see solutions to this problem as parameters that force the empirical and model moments to match. In fact, one can arrive at the log-linear formulation and the CL objective also by searching for any distribution that matches the given feature moments and has maximal entropy. Particularly in the context of text classification the CL based approach is therefore often referred to as Maximum Entropy approach. On many text classification datasets it can be shown to outperform the MLE approach substantially.
Note that the CL objective is strictly concave. This means that when the gradient becomes zero we have found the single optimum to this function.