Structured Prediction

In general there seems to be no emerging unified theory of NLP to emerge, and most textbooks and courses explain NLP as

collection of problems, techniques, ideas, frameworks, etc. that really are not tied together in any reasonable way other than the fact that they have to do with NLP.

-- Hal Daume

That's not to say though that they aren't cross-cutting patterns, general best practices and recipes that reoccur frequently. One such reoccurring pattern I found in many NLP papers and systems is what I like to refer to as the structured prediction recipe. But there are recipes that reoccur frequently, such as Structured Prediction

The general goal we address with this recipe is the following. We like to, given some input structure xX, predict a suitable output structure yY. In the context of NLP X may be the set of document, and Y a set of document classes (e.g. sports and business). X may also be the set of French sentences, and Y the set of English sentences. In this case each yY is a structured object (hence the use of bold face). This structured output aspect of the problem has profound consequences on the methods used to address it (as opposed to structure in the input, which one can deal with relatively straight-forwardly). Generally we are also given some training set Dtrain which may contain input-output pairs (x,y)i, but possibly also just input data (in unsupervised learning), annotated data but for a different task (multi-task, distant, weak supervision, etc.), or some mixture of it.

With the above ingredients the recipe goes as follows:

  1. Define a parametrized model sθ(x,y) that measures the match of a given x and y. This model builds in some of the background knowledge we have about the task. The model is also controlled by a set of real-valued parameters θ that are usually too numerous to be hand-tuned.
  2. Learn the parameters θ from the training data Dtrain, ideally such that performance on the task of choice is optimized. This learning step usually involves some continuous optimization problem that serves as a surrogate for the task performance we like to maximize.
  3. Given an input x find the highest-scoring (and hence best-matching) output structure
    y=argmaxyYs(x,y)
    to serve as the prediction of the model. Given that most of the structures we care about in NLP are discrete, this task usually involves some discrete optimization problem and is important not just at test time when the model is applied, but often also during training (as, intuitively, we like to train the model such that it predicts well).

You will see examples of this recipe throughout the book, as well as frameworks and methods that make this recipe possible. It's worthwhile noting that good NLPers usually combine three skills in accordance with this recipe: 1. modelling, 2. continuous optimization and 3. discrete optimization. For the second and third some basic mathematical background is generally useful, for the first some understanding of the language phenomena you seek to model can be helpful. It's probably fair to say that modelling is the most important bit, and in practice this often shows through the fact that clever features (part of the model) beat clever optimization quite often.

In this Book

The structured prediction recipe can be found in several places within this book: