$\def\prob{p} \def\vocab{V} \def\params{\boldsymbol{\theta}} \def\argmax{\mathop{\arg\,\max}} \def\param{\theta} \def\bpi{\boldsymbol{\pi}} \def\balpha{\boldsymbol{\alpha}} \def\bbeta{\boldsymbol{\beta}} \def\perplexity{PP} \def\x{\mathbf{x}} \def\y{\mathbf{y}} \def\Xs{\mathcal{X}} \def\Ys{\mathcal{Y}} \def\perplexity{PP} \def\train{\mathcal{D}_\mathit{train}} \def\counts#1#2{\#_{#1}(#2)} \def\RR{\bf R} \def\length#1{\mathop{length}(#1)} \def\aligns{\mathbf{a}} \def\align{a} \def\source{\mathbf{s}} \def\ssource{s} \def\target{\mathbf{t}} \def\starget{t}$

Parsing

Based on Mike Collins Lecture.

It is useful to understand the syntactic structure of a sentence: where are the verbs, what are the subject and object of the verbs, which phrases form coherent sub-structures?

In linguistics these questions are asked in the field of syntax, from the Greek syntaxis (arrangement). There are three core concepts:

Constituency: groups of words act as single units.
Grammatical Relations: object, subject, direct object etc.
Subcategorization: restrictions on the type of phrases that go with certain words.

Context Free Grammars

Context Free Grammars define rules that expand ...

More formally, a CFG is a 4-tuple $G=(N,\Sigma,R,S)$ where

$N$ is a set of non-terminal symbols.
$\Sigma$ is a set of terminal symbols.
$R$ is a finite set of rules $X \rightarrow Y_1 Y_2\ldots Y_n$ where $X \in N$ and $Y_i \in N \cup \Sigma$ .
$S \in N$ is a start symbol.

Before we show examples, let us define a scala data structure for PCFGs.

Let us now create an example CFG

'S	'NP_s 'VP_s
'S	'NP_p 'VP_p
'ADJ	silly
'NP_p	Matko raps
'NP_s	Matko
'VP_p	are 'ADJ
'VP_s	raps in StatNLP

(Left-most) Derivation

A left-most derivation given a CFG $G$ is a sequence of strings $s_1 \ldots s_n$ such that

$s_1 = S$ , that is, the first string consists only of the start symbol.
$s_n \in \Sigma^*$ , that is, the last string consists of only terminals.
Each $s_i$ for $i > 1$ is generated by replacing the left-most non-terminal $\alpha$ with the right-hand side of any rule that has $\alpha$ as left-hand side.

Let us write some code that puts this definition into action and generates random derivations based on a grammar.

Let us generate an example derivation.

'NP_s 'VP_s

Matko 'VP_s

Matko raps in StatNLP

Parse Trees

Derivations can be compactly present as trees where each non-leaf node corresponds to an expanded left-hand-side and its children to the rules' right hand side.

A scala data structure to represent trees

We can construct trees through parse the usual case class constructors, and render them graphically.

In the same way we could generate derivations before, we can now generate parse trees from a CFG.

Now let us generate a tree, starting from a non-terminal in the CFG.

Finding Parse Trees

Parsing: find a legal parse tree given a sentence and a grammar.

Top-Down: Start with the start symbol and generate trees; backtrack if they do not match observed sentence.
Bottom-Up: Start with the sentence, and find rules that generate parts of it; backtrack if no start symbol can be induced.
Dynamic Programming: Explore several trees in parallel and re-use computations.

Bottom-Up Parsing with Backtracking

Reduce a stack of trees using rules, shift tokens on the stack, backtrack when we reach the end but have more than one tree on the stack.

A scala bottom up parser:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
type Stack = List[ParseTree] // stack of processed trees
type Buffer = List[String] // buffer of remaining words
case class State(stack:Stack, buffer:Buffer, rules:List[Rule])
case class Transition(state:State, op:Symbol)
def bottomUpParse(cfg:CFG, sentence:List[String]) = {
  var history:List[State] = Nil
  var state = State(Nil,sentence,Nil) // initial state
  var transitions = Transition(state,'Init) :: Nil // list of transitions
  def reduce() {
    while (state.rules.nonEmpty) {
      val rule = state.rules.head // take the leftmost rule
      val top = state.stack.take(rule.rhs.length).reverse // take elems from stack
      if (top.map(_.node) == rule.rhs) { // if rule can be applied
        history = state :: history // add state to the history
        state = state.copy(
          stack = Node(rule.lhs, top) :: state.stack.drop(rule.rhs.length), // take rhs 
off the stack, add lhs on
          rules = cfg.R // re-populate rules
        )
        transitions = Transition(state,'Reduce) :: transitions // log
      } else state = state.copy(rules = state.rules.tail) // otherwise ignore the rule
    }   
  }
  def shift() { // shift the left word from the buffer to the stack
    state = state.copy(
      stack = Leaf(state.buffer.head) :: state.stack,   // add word to the stack
      buffer = state.buffer.tail,                       // decrease the buffer
      rules = cfg.R                                     // populate rules
    )
    transitions = Transition(state,'Shift) :: transitions // log
  }
  def backtrack() { // go back to the previous state
    state = history.head.copy(rules = history.head.rules.tail)  // state = current state 
without the last rule
    transitions = Transition(state,'Backtrack) :: transitions   // log
    history = history.tail                                      // get rid of the last 
history element
  }
  // aka shift-reduce parsing with backtracking
  while (state.buffer.nonEmpty || (state.stack.length > 1 && history.nonEmpty) || state
.rules.nonEmpty) {
    if (state.rules.nonEmpty) reduce()              // reduce, if there are rules to 
reduce with
    if (state.buffer.nonEmpty) shift()              // shift, if there is something on the 
buffer to shift
    else if (state.stack.length > 1) backtrack()    // backtrack, if nothing else to do
  }
  transitions 
} 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

res22: ParseTree = Node(NonTerminal('S),List(Node(NonTerminal('NP_s),List(Leaf(Terminal(Matko)))), Node(NonTerminal('VP_s),List(Leaf(Terminal(raps)), Leaf(Terminal(in)), Leaf(Terminal(StatNLP))))))

Let us run an example parse.

'Init		MCRiedel raps are silly
'Shift	MCRiedel	raps are silly
'Shift	MCRiedel \| raps	are silly
'Shift	MCRiedel \| raps \| are	silly
'Shift	MCRiedel \| raps \| are \| silly
'Reduce	MCRiedel \| raps \| are \| ('ADJ silly)
'Reduce	MCRiedel \| raps \| ('VP_p are ('ADJ silly))
'Backtrack	MCRiedel \| raps \| are \| ('ADJ silly)
'Backtrack	MCRiedel \| raps \| are \| silly

Dynamic Programming for Parsing

A problem with the bottom-up parser is the fact that, after back-tracking, it may redo many steps it had already before (can you find examples of this behaviour in the transitions above?). This suggests to remember steps and re-use then when needed. This is the general idea behind dynamic programming: caching and reusing computation whenever possible.

Chomsky Normal Form

In the case of CFG parsing there exists a very effective dynamic program, the so-called Cocke–Younger–Kasami (CYK) algorithm. However, before we can apply this algorithm we need to normalize the grammar. In particular, we need to make sure that each rule has one of the following forms:

$\alpha \rightarrow \beta \gamma$ where $\beta,\gamma \in N \setminus \{ S \}$ .
$\alpha \rightarrow t$ where $t \in \Sigma$ .

In words: each rule is either binary and expands into two non-terminal non-Start symbols, or unary and expands into a word. This form is called Chomsky Normal Form (CNF).

Fortunately we can convert every CFG into an equivalent CFG in CNF, in the sense that any derivation or parse of sentence in one grammar can be loss-lessly converted to a derivation in the other grammar. We present this conversion in scala below, but omitted cases not relevant to our grammar (Exercise: add these cases).

'S	'NP_p 'VP_p
'S	'NP_s 'VP_s
''VP_s3	'raps5 'in6
'ADJ	silly
'Matko0	Matko
'NP_p	'Matko0 'raps1
'NP_s	Matko
'StatNLP4	StatNLP
'VP_p	'are2 'ADJ
'VP_s	''VP_s3 'StatNLP4
'are2	are
'in6	in
'raps1	raps
'raps5	raps

CYK algorithm

The CYK algorithm caches, for each span in the sentence, all possible trees that can cover the span according to the CFG. Again we record all changes throughout the algorithm for later visualization.

Let us run the algorithm.

Exercise

Manually perform CYK for this data

Ambiguity

Sentences can have several legal parse trees.

Can we assign the better parse a higher probability?

Probabilistic Context Free Grammars

Probabilistic Context Free Grammars (PFCGs) are Context Free Grammars in which rules have probabilities. More formally, a PCFG consists of

A Context Free Grammar $G(N,\Sigma,R,S)$ .
A parameter $q(\alpha \rightarrow \beta)$ for each rule $\alpha \rightarrow \beta \in R$ . For each possible left hand side $\alpha \in N$ we require $\sum_\beta q(\alpha \rightarrow \beta) = 1$ .

A PCFG defines a probability distribution over parse trees as follows. Given a parse tree $\mathbf{t}$ that contains the rules $\alpha_1 \rightarrow \beta_1, \ldots, \alpha_n \rightarrow \beta_n$ , the probability of this tree under the PCFG is:

p (t) = \prod_{i}^{n} q (α_{i} \to β_{i})

$\newcommand{parse}{\mathbf{t}} p(\parse) = \prod_i^n q(\alpha_i \rightarrow \beta_i)$

Notice that we can develop and operate parsers with the structured prediction recipe. We have model $p$ , some parameters $\params$ that need to be estimated on a training set, and the prediction/search problem of finding the most likely parse tree given a sentence. The next sections will cover these aspects.

Before we show examples, let us define a scala data structure for PCFGs.

Let us now create an example PCFG.

'S	'Subj 'VP	1.0
'Obj	the elephant	0.5
'Obj	the elephant 'PP	0.5
'PP	in his pyjamas	1.0
'Subj	He	1.0
'VP	'Verb 'Obj	0.7
'VP	'Verb 'Obj 'PP	0.3
'Verb	shot	1.0

Let us first focus on the prediction task: given a sentence, find the highest scoring parse tree. In a way, we have already solved a variant of this problem. We can consider a CFG as a deterministic distribution over trees, and finding a highest scoring parse is equivalent to finding any legal tree—our bottom-up and dynamic program based algorithms hence performed a structured prediction.

CYK for PCFGs

Here we present a probabilistic variant of the CYK algorithm. Notice that again we require normalization to CNF. Exercise: how to incorporate the rule probabilities during normalization?

The above algorithm now returns only a single parse for the given sentence: the one with the higher probability. You can adapt to return all parses by changing the unify method to return the input as is.