Parsing

Based on Mike Collins Lecture.

It is useful to understand the syntactic structure of a sentence: where are the verbs, what are the subject and object of the verbs, which phrases form coherent sub-structures?

In linguistics these questions are asked in the field of syntax, from the Greek syntaxis (arrangement). There are three core concepts:

  • Constituency: groups of words act as single units.
  • Grammatical Relations: object, subject, direct object etc.
  • Subcategorization: restrictions on the type of phrases that go with certain words.

Context Free Grammars

Context Free Grammars define rules that expand ...

More formally, a CFG is a 4-tuple G=(N,Σ,R,S) where

  • N is a set of non-terminal symbols.
  • Σ is a set of terminal symbols.
  • R is a finite set of rules XY1Y2Yn where XN and YiNΣ.
  • SN is a start symbol.

Before we show examples, let us define a scala data structure for PCFGs.

1

Let us now create an example CFG

'S 'NP_s 'VP_s
'S 'NP_p 'VP_p
'ADJ silly
'NP_p Matko raps
'NP_s Matko
'VP_p are 'ADJ
'VP_s raps in StatNLP

(Left-most) Derivation

A left-most derivation given a CFG G is a sequence of strings s1sn such that

  • s1=S, that is, the first string consists only of the start symbol.
  • snΣ, that is, the last string consists of only terminals.
  • Each si for i>1 is generated by replacing the left-most non-terminal α with the right-hand side of any rule that has α as left-hand side.

Let us write some code that puts this definition into action and generates random derivations based on a grammar.

Let us generate an example derivation.

'NP_s 'VP_s
Matko 'VP_s
Matko raps in StatNLP

Parse Trees

Derivations can be compactly present as trees where each non-leaf node corresponds to an expanded left-hand-side and its children to the rules' right hand side.

A scala data structure to represent trees

We can construct trees through parse the usual case class constructors, and render them graphically.

'S'NP_pMatkoraps'VP_paresilly

In the same way we could generate derivations before, we can now generate parse trees from a CFG.

Now let us generate a tree, starting from a non-terminal in the CFG.

'S'NP_sMatko'VP_srapsinStatNLP

Finding Parse Trees

Parsing: find a legal parse tree given a sentence and a grammar.

  • Top-Down: Start with the start symbol and generate trees; backtrack if they do not match observed sentence.
  • Bottom-Up: Start with the sentence, and find rules that generate parts of it; backtrack if no start symbol can be induced.
  • Dynamic Programming: Explore several trees in parallel and re-use computations.

Bottom-Up Parsing with Backtracking

Reduce a stack of trees using rules, shift tokens on the stack, backtrack when we reach the end but have more than one tree on the stack.

A scala bottom up parser:

res22: ParseTree = Node(NonTerminal('S),List(Node(NonTerminal('NP_s),List(Leaf(Terminal(Matko)))), Node(NonTerminal('VP_s),List(Leaf(Terminal(raps)), Leaf(Terminal(in)), Leaf(Terminal(StatNLP))))))

Let us run an example parse.

'Init MCRiedel raps are silly
'Shift MCRiedel raps are silly
'Shift MCRiedel | raps are silly
'Shift MCRiedel | raps | are silly
'Shift MCRiedel | raps | are | silly
'Reduce MCRiedel | raps | are | ('ADJ silly)
'Reduce MCRiedel | raps | ('VP_p are ('ADJ silly))
'Backtrack MCRiedel | raps | are | ('ADJ silly)
'Backtrack MCRiedel | raps | are | silly

Dynamic Programming for Parsing

A problem with the bottom-up parser is the fact that, after back-tracking, it may redo many steps it had already before (can you find examples of this behaviour in the transitions above?). This suggests to remember steps and re-use then when needed. This is the general idea behind dynamic programming: caching and reusing computation whenever possible.

Chomsky Normal Form

In the case of CFG parsing there exists a very effective dynamic program, the so-called Cocke–Younger–Kasami (CYK) algorithm. However, before we can apply this algorithm we need to normalize the grammar. In particular, we need to make sure that each rule has one of the following forms:

  • αβγ where β,γN{S}.
  • αt where tΣ.

In words: each rule is either binary and expands into two non-terminal non-Start symbols, or unary and expands into a word. This form is called Chomsky Normal Form (CNF).

Fortunately we can convert every CFG into an equivalent CFG in CNF, in the sense that any derivation or parse of sentence in one grammar can be loss-lessly converted to a derivation in the other grammar. We present this conversion in scala below, but omitted cases not relevant to our grammar (Exercise: add these cases).

'S 'NP_p 'VP_p
'S 'NP_s 'VP_s
''VP_s3 'raps5 'in6
'ADJ silly
'Matko0 Matko
'NP_p 'Matko0 'raps1
'NP_s Matko
'StatNLP4 StatNLP
'VP_p 'are2 'ADJ
'VP_s ''VP_s3 'StatNLP4
'are2 are
'in6 in
'raps1 raps
'raps5 raps

CYK algorithm

The CYK algorithm caches, for each span in the sentence, all possible trees that can cover the span according to the CFG. Again we record all changes throughout the algorithm for later visualization.

Let us run the algorithm.

'S'NP_sMatko'VP_s''VP_s3'raps5raps'in6in'StatNLP4StatNLP
Exercise
Manually perform CYK for this data

Ambiguity

Sentences can have several legal parse trees.

'S'SubjHe'VP'Verbshot'Objtheelephant'PPinhispyjamas

Can we assign the better parse a higher probability?

Probabilistic Context Free Grammars

Probabilistic Context Free Grammars (PFCGs) are Context Free Grammars in which rules have probabilities. More formally, a PCFG consists of

  • A Context Free Grammar G(N,Σ,R,S).
  • A parameter q(αβ) for each rule αβR. For each possible left hand side αN we require βq(αβ)=1.

A PCFG defines a probability distribution over parse trees as follows. Given a parse tree t that contains the rules α1β1,,αnβn, the probability of this tree under the PCFG is:

p(t)=inq(αiβi)

Notice that we can develop and operate parsers with the structured prediction recipe. We have model p, some parameters θ that need to be estimated on a training set, and the prediction/search problem of finding the most likely parse tree given a sentence. The next sections will cover these aspects.

Before we show examples, let us define a scala data structure for PCFGs.

Let us now create an example PCFG.

'S 'Subj 'VP 1.0
'Obj the elephant 0.5
'Obj the elephant 'PP 0.5
'PP in his pyjamas 1.0
'Subj He 1.0
'VP 'Verb 'Obj 0.7
'VP 'Verb 'Obj 'PP 0.3
'Verb shot 1.0

Let us first focus on the prediction task: given a sentence, find the highest scoring parse tree. In a way, we have already solved a variant of this problem. We can consider a CFG as a deterministic distribution over trees, and finding a highest scoring parse is equivalent to finding any legal tree—our bottom-up and dynamic program based algorithms hence performed a structured prediction.

CYK for PCFGs

Here we present a probabilistic variant of the CYK algorithm. Notice that again we require normalization to CNF. Exercise: how to incorporate the rule probabilities during normalization?

The above algorithm now returns only a single parse for the given sentence: the one with the higher probability. You can adapt to return all parses by changing the unify method to return the input as is.

'S'SubjHe'VP'Verbshot'Objtheelephant'PPinhispyjamas