Parsing

Note that this chapter is heavily influenced by the structure and content of Mike Collins' PCFG lecture. Based on Mike Collins Lecture.

In many NLP applications it is useful to understand the syntactic structure of a sentence: where are the verbs, what are the subject and object of the verbs, which phrases form coherent sub-structures of the sentence? Understanding this enables the machine to more effectively translate from Japanese to English, or to understand the query "who is the president of the united state" and execute it against a database. It is useful to understand the syntactic structure of a sentence: where are the verbs, what are the subject and object of the verbs, which phrases form coherent sub-structures?

In linguistics these questions are asked in the field of syntax, from the Greek syntaxis (arrangement). There are three core concepts:

Constituency: groups of words act as single units.
Grammatical Relations: object, subject, direct object etc.
Subcategorization: restrictions on the type of phrases that go with certain words.

Context Free Grammars

A common approach to capture constituency, grammatical relations and subcategorization is based on Context Free Grammars (CFGs). On a high level, these grammars assume that legal sentences can be derived by repeatedly and independently expanding abstract symbols (such as "NounPhrase" or "Adjective") into more concrete sequences of symbols (such as "Adjective Noun" or "green") until each symbol is a concrete word. Context Free Grammars define rules that expand ...

More formally, a CFG is a 4-tuple $G=(N,\Sigma,R,S)$ where

$N$ is a set of non-terminal symbols.
$\Sigma$ is a set of terminal symbols.
$R$ is a finite set of rules $X \rightarrow Y_1 Y_2\ldots Y_n$ where $X \in N$ and $Y_i \in N \cup \Sigma$ .
$S \in N$ is a start symbol.

Before we show examples, let us define a scala data structure for PCFGs.

1

Let us now create an example CFG. Notice that we implemented a implicit conversion methods which will allow us to construct terminals, non-terminals and rules more succinctly. We omit these methods here for brevity. Let us now create an example CFG

'S	'NP_s 'VP_s
'S	'NP_p 'VP_p
'ADJ	silly
'NP_p	Matko raps
'NP_s	Matko
'VP_p	are 'ADJ
'VP_s	raps in StatNLP

(Left-most) Derivation

A left-most derivation given a CFG $G$ is a sequence of strings $s_1 \ldots s_n$ such that

$s_1 = S$ , that is, the first string consists only of the start symbol.
$s_n \in \Sigma^*$ , that is, the last string consists of only terminals.
Each $s_i$ for $i > 1$ is generated by replacing the left-most non-terminal $\alpha$ with the right-hand side of any rule that has $\alpha$ as left-hand side.

Let us write some code that puts this definition into action and generates random derivations based on a grammar.

Let us generate an example derivation.

'NP_s 'VP_s

Matko 'VP_s

Matko raps in StatNLP

Parse Trees

Derivations can be compactly present as trees where each non-leaf node corresponds to an expanded left-hand-side and its children to the rules' right hand side. Derivations can be compactly present as trees where each non-leaf node corresponds to an expanded left-hand-side and its children to the rules' right hand side.

A scala data structure to represent trees

We can construct trees through parse the usual case class constructors, and render them graphically.

In the same way we could generate derivations before, we can now generate parse trees from a CFG.

Now let us generate a tree, starting from a non-terminal in the CFG.

Finding Parse Trees

As you can see, parse trees uncover how a sentence is structured with respect to a grammar. In fact, they give us insights about both constituency and grammatical relations: In the above case the subtrees headed by "NP" nodes represent noun phrase constituents; likewise, the "NP" and "VP" subtrees under "S" represent the subject and object of a sentence. It is interesting to generate trees and hence sentences given a starting symbol. However, in practice it is often more useful to find a parse tree given a sentence, if existent. This process is generally referred to as parsing. Parsing: find a legal parse tree given a sentence and a grammar.

There are a couple of approaches to find a legal parse tree given a sentence and grammar:

Top-Down: Start with the start symbol and generate trees; backtrack if they do not match observed sentence.
Bottom-Up: Start with the sentence, and find rules that generate parts of it; backtrack if no start symbol can be induced.
Dynamic Programming: Explore several trees in parallel and re-use computations.

Bottom-Up Parsing with Backtracking

A bottom-up parser with backtracking maintains a state consisting of a stack of processed trees, and buffer of remaining words, and a history of previous states we can backtrack to if we cannot make more progress. Roughly speaking, the algorithm proceeds by iteratively reducing the stack via rules to new trees, moving forward in the sentence, and backtracking to previous states when we reach the end of the sentence but cannot create a single tree. Reduce a stack of trees using rules, shift tokens on the stack, backtrack when we reach the end but have more than one tree on the stack.

In Scala code the parsing algorithm can be formulated as follows. Notice the use of immutable data structures which makes storing and recalling previous states very efficient. Also notice that we record a history of transitions. This is not needed for the algorithm to work, but allows us to inspect its behaviour afterwards. A scala bottom up parser:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
type Stack = List[ParseTree] // stack of processed trees
type Buffer = List[String] // buffer of remaining words
case class State(stack:Stack, buffer:Buffer, rules:List[Rule])
case class Transition(state:State, op:Symbol)
def bottomUpParse(cfg:CFG, sentence:List[String]) = {
  var history:List[State] = Nil
  var state = State(Nil,sentence,Nil) // initial state
  var transitions = Transition(state,'Init) :: Nil // list of transitions
  def reduce() {
    while (state.rules.nonEmpty) {
      val rule = state.rules.head // take the leftmost rule
      val top = state.stack.take(rule.rhs.length).reverse // take elems from stack
      if (top.map(_.node) == rule.rhs) { // if rule can be applied
        history = state :: history // add state to the history
        state = state.copy(
          stack = Node(rule.lhs, top) :: state.stack.drop(rule.rhs.length), // take rhs off the stack, add lhs on
          rules = cfg.R // re-populate rules
        )
        transitions = Transition(state,'Reduce) :: transitions // log
      } else state = state.copy(rules = state.rules.tail) // otherwise ignore the rule
    }   
  }
  def shift() { // shift the left word from the buffer to the stack
    state = state.copy(
      stack = Leaf(state.buffer.head) :: state.stack,   // add word to the stack
      buffer = state.buffer.tail,                       // decrease the buffer
      rules = cfg.R                                     // populate rules
    )
    transitions = Transition(state,'Shift) :: transitions // log
  }
  def backtrack() { // go back to the previous state
    state = history.head.copy(rules = history.head.rules.tail)  // state = current state without the last rule
    transitions = Transition(state,'Backtrack) :: transitions   // log
    history = history.tail                                      // get rid of the last history element
  }
  // aka shift-reduce parsing with backtracking
  while (state.buffer.nonEmpty || (state.stack.length > 1 && history.nonEmpty) || state.rules.nonEmpty) {
    if (state.rules.nonEmpty) reduce()              // reduce, if there are rules to reduce with
    if (state.buffer.nonEmpty) shift()              // shift, if there is something on the buffer to shift
    else if (state.stack.length > 1) backtrack()    // backtrack, if nothing else to do
  }
  transitions 
} 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

res22: ParseTree = Node(NonTerminal('S),List(Node(NonTerminal('NP_s),List(Leaf(Terminal(Matko)))), Node(NonTerminal('VP_s),List(Leaf(Terminal(raps)), Leaf(Terminal(in)), Leaf(Terminal(StatNLP))))))

Let us run an example parse.

'Init		MCRiedel raps are silly
'Shift	MCRiedel	raps are silly
'Shift	MCRiedel \| raps	are silly
'Shift	MCRiedel \| raps \| are	silly
'Shift	MCRiedel \| raps \| are \| silly
'Reduce	MCRiedel \| raps \| are \| ('ADJ silly)
'Reduce	MCRiedel \| raps \| ('VP_p are ('ADJ silly))
'Backtrack	MCRiedel \| raps \| are \| ('ADJ silly)
'Backtrack	MCRiedel \| raps \| are \| silly

Dynamic Programming for Parsing

A problem with the bottom-up parser is the fact that, after back-tracking, it may redo many steps it had already before (can you find examples of this behaviour in the transitions above?). This suggests to remember steps and re-use then when needed. This is the general idea behind dynamic programming: caching and reusing computation whenever possible.

Chomsky Normal Form

In the case of CFG parsing there exists a very effective dynamic program, the so-called Cocke–Younger–Kasami (CYK) algorithm. However, before we can apply this algorithm we need to normalize the grammar. In particular, we need to make sure that each rule has one of the following forms:

$\alpha \rightarrow \beta \gamma$ where $\beta,\gamma \in N \setminus \{ S \}$ .
$\alpha \rightarrow t$ where $t \in \Sigma$ .

In words: each rule is either binary and expands into two non-terminal non-Start symbols, or unary and expands into a word. This form is called Chomsky Normal Form (CNF).

Fortunately we can convert every CFG into an equivalent CFG in CNF, in the sense that any derivation or parse of sentence in one grammar can be loss-lessly converted to a derivation in the other grammar. We present this conversion in scala below, but omitted cases not relevant to our grammar (Exercise: add these cases).

'S	'NP_p 'VP_p
'S	'NP_s 'VP_s
''VP_s3	'raps5 'in6
'ADJ	silly
'Matko0	Matko
'NP_p	'Matko0 'raps1
'NP_s	Matko
'StatNLP4	StatNLP
'VP_p	'are2 'ADJ
'VP_s	''VP_s3 'StatNLP4
'are2	are
'in6	in
'raps1	raps
'raps5	raps

CYK algorithm

The CYK algorithm caches, for each span in the sentence, all possible trees that can cover the span according to the CFG. Again we record all changes throughout the algorithm for later visualization.

Let us run the algorithm.

Exercise

Manually perform CYK for this data

Ambiguity

For real world grammars many phrases have several legal parse trees. Which one is the correct one depends on the intent of the speaker. That said, in many cases it is quite obvious that one parse tree should be more likely, or have a higher probability. This suggest a probabilistic treatment of grammars and parsing. Before we introduce probabilistic CFGs, let us inspect a typical case of syntactic natural language ambiguity. Sentences can have several legal parse trees.

The example is an instance of prepositional phrase attachment ambiguity: the "in his pyjamas" phrase could both be part of the verb phrase, meaning that the shooting happened in pyjamas, or part of the noun phrase "the elephant", meaning that the elephant was in pyjamas. Both readings make syntactic sense, so there is nothing wrong with the grammar. However, we would like the machine to return the preferred reading when such sentence is parsed. For this we need to find a way to assign different readings different probabilities. Can we assign the better parse a higher probability?

Probabilistic Context Free Grammars

Probabilistic Context Free Grammars (PFCGs) are Context Free Grammars in which rules have probabilities. More formally, a PCFG consists of

A Context Free Grammar $G(N,\Sigma,R,S)$ .
A parameter $q(\alpha \rightarrow \beta)$ for each rule $\alpha \rightarrow \beta \in R$ . For each possible left hand side $\alpha \in N$ we require $\sum_\beta q(\alpha \rightarrow \beta) = 1$ .

A PCFG defines a probability distribution over parse trees as follows. Given a parse tree $\mathbf{t}$ that contains the rules $\alpha_1 \rightarrow \beta_1, \ldots, \alpha_n \rightarrow \beta_n$ , the probability of this tree under the PCFG is:

p (t) = \prod_{i}^{n} q (α_{i} \to β_{i})

$\newcommand{parse}{\mathbf{t}} p(\parse) = \prod_i^n q(\alpha_i \rightarrow \beta_i)$

Notice that we can develop and operate parsers with the structured prediction recipe. We have model $p$ , some parameters $\params$ that need to be estimated on a training set, and the prediction/search problem of finding the most likely parse tree given a sentence. The next sections will cover these aspects.

Before we show examples, let us define a scala data structure for PCFGs.

Let us now create an example PCFG.

'S	'Subj 'VP	1.0
'Obj	the elephant	0.5
'Obj	the elephant 'PP	0.5
'PP	in his pyjamas	1.0
'Subj	He	1.0
'VP	'Verb 'Obj	0.7
'VP	'Verb 'Obj 'PP	0.3
'Verb	shot	1.0

Let us first focus on the prediction task: given a sentence, find the highest scoring parse tree. In a way, we have already solved a variant of this problem. We can consider a CFG as a deterministic distribution over trees, and finding a highest scoring parse is equivalent to finding any legal tree—our bottom-up and dynamic program based algorithms hence performed a structured prediction.

CYK for PCFGs

Here we present a probabilistic variant of the CYK algorithm. Notice that again we require normalization to CNF. Exercise: how to incorporate the rule probabilities during normalization?

The above algorithm now returns only a single parse for the given sentence: the one with the higher probability. You can adapt to return all parses by changing the unify method to return the input as is.