Parsing

Note that this chapter is heavily influenced by the structure and content of Mike Collins' PCFG lecture. Based on Mike Collins Lecture.

In many NLP applications it is useful to understand the syntactic structure of a sentence: where are the verbs, what are the subject and object of the verbs, which phrases form coherent sub-structures of the sentence? Understanding this enables the machine to more effectively translate from Japanese to English, or to understand the query "who is the president of the united state" and execute it against a database. It is useful to understand the syntactic structure of a sentence: where are the verbs, what are the subject and object of the verbs, which phrases form coherent sub-structures?

In linguistics these questions are asked in the field of syntax, from the Greek syntaxis (arrangement). There are three core concepts:

  • Constituency: groups of words act as single units.
  • Grammatical Relations: object, subject, direct object etc.
  • Subcategorization: restrictions on the type of phrases that go with certain words.

Context Free Grammars

A common approach to capture constituency, grammatical relations and subcategorization is based on Context Free Grammars (CFGs). On a high level, these grammars assume that legal sentences can be derived by repeatedly and independently expanding abstract symbols (such as "NounPhrase" or "Adjective") into more concrete sequences of symbols (such as "Adjective Noun" or "green") until each symbol is a concrete word. Context Free Grammars define rules that expand ...

More formally, a CFG is a 4-tuple G=(N,Σ,R,S) where

  • N is a set of non-terminal symbols.
  • Σ is a set of terminal symbols.
  • R is a finite set of rules XY1Y2Yn where XN and YiNΣ.
  • SN is a start symbol.

Before we show examples, let us define a scala data structure for PCFGs.

1

Let us now create an example CFG. Notice that we implemented a implicit conversion methods which will allow us to construct terminals, non-terminals and rules more succinctly. We omit these methods here for brevity. Let us now create an example CFG

'S 'NP_s 'VP_s
'S 'NP_p 'VP_p
'ADJ silly
'NP_p Matko raps
'NP_s Matko
'VP_p are 'ADJ
'VP_s raps in StatNLP

(Left-most) Derivation

A left-most derivation given a CFG G is a sequence of strings s1sn such that

  • s1=S, that is, the first string consists only of the start symbol.
  • snΣ, that is, the last string consists of only terminals.
  • Each si for i>1 is generated by replacing the left-most non-terminal α with the right-hand side of any rule that has α as left-hand side.

Let us write some code that puts this definition into action and generates random derivations based on a grammar.

Let us generate an example derivation.

'NP_s 'VP_s
Matko 'VP_s
Matko raps in StatNLP

Parse Trees

Derivations can be compactly present as trees where each non-leaf node corresponds to an expanded left-hand-side and its children to the rules' right hand side. Derivations can be compactly present as trees where each non-leaf node corresponds to an expanded left-hand-side and its children to the rules' right hand side.

A scala data structure to represent trees

We can construct trees through parse the usual case class constructors, and render them graphically.

'S'NP_pMatkoraps'VP_paresilly

In the same way we could generate derivations before, we can now generate parse trees from a CFG.

Now let us generate a tree, starting from a non-terminal in the CFG.

'S'NP_sMatko'VP_srapsinStatNLP

Finding Parse Trees

As you can see, parse trees uncover how a sentence is structured with respect to a grammar. In fact, they give us insights about both constituency and grammatical relations: In the above case the subtrees headed by "NP" nodes represent noun phrase constituents; likewise, the "NP" and "VP" subtrees under "S" represent the subject and object of a sentence. It is interesting to generate trees and hence sentences given a starting symbol. However, in practice it is often more useful to find a parse tree given a sentence, if existent. This process is generally referred to as parsing. Parsing: find a legal parse tree given a sentence and a grammar.

There are a couple of approaches to find a legal parse tree given a sentence and grammar:

  • Top-Down: Start with the start symbol and generate trees; backtrack if they do not match observed sentence.
  • Bottom-Up: Start with the sentence, and find rules that generate parts of it; backtrack if no start symbol can be induced.
  • Dynamic Programming: Explore several trees in parallel and re-use computations.

Bottom-Up Parsing with Backtracking

A bottom-up parser with backtracking maintains a state consisting of a stack of processed trees, and buffer of remaining words, and a history of previous states we can backtrack to if we cannot make more progress. Roughly speaking, the algorithm proceeds by iteratively reducing the stack via rules to new trees, moving forward in the sentence, and backtracking to previous states when we reach the end of the sentence but cannot create a single tree. Reduce a stack of trees using rules, shift tokens on the stack, backtrack when we reach the end but have more than one tree on the stack.

In Scala code the parsing algorithm can be formulated as follows. Notice the use of immutable data structures which makes storing and recalling previous states very efficient. Also notice that we record a history of transitions. This is not needed for the algorithm to work, but allows us to inspect its behaviour afterwards. A scala bottom up parser:

res22: ParseTree = Node(NonTerminal('S),List(Node(NonTerminal('NP_s),List(Leaf(Terminal(Matko)))), Node(NonTerminal('VP_s),List(Leaf(Terminal(raps)), Leaf(Terminal(in)), Leaf(Terminal(StatNLP))))))

Let us run an example parse.

'Init MCRiedel raps are silly
'Shift MCRiedel raps are silly
'Shift MCRiedel | raps are silly
'Shift MCRiedel | raps | are silly
'Shift MCRiedel | raps | are | silly
'Reduce MCRiedel | raps | are | ('ADJ silly)
'Reduce MCRiedel | raps | ('VP_p are ('ADJ silly))
'Backtrack MCRiedel | raps | are | ('ADJ silly)
'Backtrack MCRiedel | raps | are | silly

Dynamic Programming for Parsing

A problem with the bottom-up parser is the fact that, after back-tracking, it may redo many steps it had already before (can you find examples of this behaviour in the transitions above?). This suggests to remember steps and re-use then when needed. This is the general idea behind dynamic programming: caching and reusing computation whenever possible.

Chomsky Normal Form

In the case of CFG parsing there exists a very effective dynamic program, the so-called Cocke–Younger–Kasami (CYK) algorithm. However, before we can apply this algorithm we need to normalize the grammar. In particular, we need to make sure that each rule has one of the following forms:

  • αβγ where β,γN{S}.
  • αt where tΣ.

In words: each rule is either binary and expands into two non-terminal non-Start symbols, or unary and expands into a word. This form is called Chomsky Normal Form (CNF).

Fortunately we can convert every CFG into an equivalent CFG in CNF, in the sense that any derivation or parse of sentence in one grammar can be loss-lessly converted to a derivation in the other grammar. We present this conversion in scala below, but omitted cases not relevant to our grammar (Exercise: add these cases).

'S 'NP_p 'VP_p
'S 'NP_s 'VP_s
''VP_s3 'raps5 'in6
'ADJ silly
'Matko0 Matko
'NP_p 'Matko0 'raps1
'NP_s Matko
'StatNLP4 StatNLP
'VP_p 'are2 'ADJ
'VP_s ''VP_s3 'StatNLP4
'are2 are
'in6 in
'raps1 raps
'raps5 raps

CYK algorithm

The CYK algorithm caches, for each span in the sentence, all possible trees that can cover the span according to the CFG. Again we record all changes throughout the algorithm for later visualization.

Let us run the algorithm.

'S'NP_sMatko'VP_s''VP_s3'raps5raps'in6in'StatNLP4StatNLP
Exercise
Manually perform CYK for this data

Ambiguity

For real world grammars many phrases have several legal parse trees. Which one is the correct one depends on the intent of the speaker. That said, in many cases it is quite obvious that one parse tree should be more likely, or have a higher probability. This suggest a probabilistic treatment of grammars and parsing. Before we introduce probabilistic CFGs, let us inspect a typical case of syntactic natural language ambiguity. Sentences can have several legal parse trees.

'S'SubjHe'VP'Verbshot'Objtheelephant'PPinhispyjamas

The example is an instance of prepositional phrase attachment ambiguity: the "in his pyjamas" phrase could both be part of the verb phrase, meaning that the shooting happened in pyjamas, or part of the noun phrase "the elephant", meaning that the elephant was in pyjamas. Both readings make syntactic sense, so there is nothing wrong with the grammar. However, we would like the machine to return the preferred reading when such sentence is parsed. For this we need to find a way to assign different readings different probabilities. Can we assign the better parse a higher probability?

Probabilistic Context Free Grammars

Probabilistic Context Free Grammars (PFCGs) are Context Free Grammars in which rules have probabilities. More formally, a PCFG consists of

  • A Context Free Grammar G(N,Σ,R,S).
  • A parameter q(αβ) for each rule αβR. For each possible left hand side αN we require βq(αβ)=1.

A PCFG defines a probability distribution over parse trees as follows. Given a parse tree t that contains the rules α1β1,,αnβn, the probability of this tree under the PCFG is:

p(t)=inq(αiβi)

Notice that we can develop and operate parsers with the structured prediction recipe. We have model p, some parameters θ that need to be estimated on a training set, and the prediction/search problem of finding the most likely parse tree given a sentence. The next sections will cover these aspects.

Before we show examples, let us define a scala data structure for PCFGs.

Let us now create an example PCFG.

'S 'Subj 'VP 1.0
'Obj the elephant 0.5
'Obj the elephant 'PP 0.5
'PP in his pyjamas 1.0
'Subj He 1.0
'VP 'Verb 'Obj 0.7
'VP 'Verb 'Obj 'PP 0.3
'Verb shot 1.0

Let us first focus on the prediction task: given a sentence, find the highest scoring parse tree. In a way, we have already solved a variant of this problem. We can consider a CFG as a deterministic distribution over trees, and finding a highest scoring parse is equivalent to finding any legal tree—our bottom-up and dynamic program based algorithms hence performed a structured prediction.

CYK for PCFGs

Here we present a probabilistic variant of the CYK algorithm. Notice that again we require normalization to CNF. Exercise: how to incorporate the rule probabilities during normalization?

The above algorithm now returns only a single parse for the given sentence: the one with the higher probability. You can adapt to return all parses by changing the unify method to return the input as is.

'S'SubjHe'VP'Verbshot'Objtheelephant'PPinhispyjamas