Based on Mike Collins Lecture.
It is useful to understand the syntactic structure of a sentence: where are the verbs, what are the subject and object of the verbs, which phrases form coherent sub-structures?
In linguistics these questions are asked in the field of syntax, from the Greek syntaxis (arrangement). There are three core concepts:
Context Free Grammars define rules that expand ...
More formally, a CFG is a 4-tuple where
Before we show examples, let us define a scala data structure for PCFGs.
Let us now create an example CFG
'S | 'NP_s 'VP_s |
'S | 'NP_p 'VP_p |
'ADJ | silly |
'NP_p | Matko raps |
'NP_s | Matko |
'VP_p | are 'ADJ |
'VP_s | raps in StatNLP |
A left-most derivation given a CFG is a sequence of strings such that
Let us write some code that puts this definition into action and generates random derivations based on a grammar.
Let us generate an example derivation.
Derivations can be compactly present as trees where each non-leaf node corresponds to an expanded left-hand-side and its children to the rules' right hand side.
A scala data structure to represent trees
We can construct trees through parse the usual case class constructors, and render them graphically.
In the same way we could generate derivations before, we can now generate parse trees from a CFG.
Now let us generate a tree, starting from a non-terminal in the CFG.
Parsing: find a legal parse tree given a sentence and a grammar.
Reduce a stack of trees using rules, shift tokens on the stack, backtrack when we reach the end but have more than one tree on the stack.
A scala bottom up parser:
Let us run an example parse.
'Init | MCRiedel raps are silly | |
'Shift | MCRiedel | raps are silly |
'Shift | MCRiedel | raps | are silly |
'Shift | MCRiedel | raps | are | silly |
'Shift | MCRiedel | raps | are | silly | |
'Reduce | MCRiedel | raps | are | ('ADJ silly) | |
'Reduce | MCRiedel | raps | ('VP_p are ('ADJ silly)) | |
'Backtrack | MCRiedel | raps | are | ('ADJ silly) | |
'Backtrack | MCRiedel | raps | are | silly |
A problem with the bottom-up parser is the fact that, after back-tracking, it may redo many steps it had already before (can you find examples of this behaviour in the transitions above?). This suggests to remember steps and re-use then when needed. This is the general idea behind dynamic programming: caching and reusing computation whenever possible.
In the case of CFG parsing there exists a very effective dynamic program, the so-called Cocke–Younger–Kasami (CYK) algorithm. However, before we can apply this algorithm we need to normalize the grammar. In particular, we need to make sure that each rule has one of the following forms:
In words: each rule is either binary and expands into two non-terminal non-Start symbols, or unary and expands into a word. This form is called Chomsky Normal Form (CNF).
Fortunately we can convert every CFG into an equivalent CFG in CNF, in the sense that any derivation or parse of sentence in one grammar can be loss-lessly converted to a derivation in the other grammar. We present this conversion in scala below, but omitted cases not relevant to our grammar (Exercise: add these cases).
'S | 'NP_p 'VP_p |
'S | 'NP_s 'VP_s |
''VP_s3 | 'raps5 'in6 |
'ADJ | silly |
'Matko0 | Matko |
'NP_p | 'Matko0 'raps1 |
'NP_s | Matko |
'StatNLP4 | StatNLP |
'VP_p | 'are2 'ADJ |
'VP_s | ''VP_s3 'StatNLP4 |
'are2 | are |
'in6 | in |
'raps1 | raps |
'raps5 | raps |
The CYK algorithm caches, for each span in the sentence, all possible trees that can cover the span according to the CFG. Again we record all changes throughout the algorithm for later visualization.
Let us run the algorithm.
Can we assign the better parse a higher probability?
Probabilistic Context Free Grammars (PFCGs) are Context Free Grammars in which rules have probabilities. More formally, a PCFG consists of
A PCFG defines a probability distribution over parse trees as follows. Given a parse tree that contains the rules , the probability of this tree under the PCFG is:
Notice that we can develop and operate parsers with the structured prediction recipe. We have model , some parameters that need to be estimated on a training set, and the prediction/search problem of finding the most likely parse tree given a sentence. The next sections will cover these aspects.
Before we show examples, let us define a scala data structure for PCFGs.
Let us now create an example PCFG.
'S | 'Subj 'VP | 1.0 |
'Obj | the elephant | 0.5 |
'Obj | the elephant 'PP | 0.5 |
'PP | in his pyjamas | 1.0 |
'Subj | He | 1.0 |
'VP | 'Verb 'Obj | 0.7 |
'VP | 'Verb 'Obj 'PP | 0.3 |
'Verb | shot | 1.0 |
Let us first focus on the prediction task: given a sentence, find the highest scoring parse tree. In a way, we have already solved a variant of this problem. We can consider a CFG as a deterministic distribution over trees, and finding a highest scoring parse is equivalent to finding any legal tree—our bottom-up and dynamic program based algorithms hence performed a structured prediction.
Here we present a probabilistic variant of the CYK algorithm. Notice that again we require normalization to CNF. Exercise: how to incorporate the rule probabilities during normalization?
The above algorithm now returns only a single parse for the given sentence: the one with the higher probability. You can adapt to return all parses by changing the unify
method to return the input as is.