Natural Language Processing: Natural Language Processing: Introduction to Syntactic Parsing Introduction to Syntactic Parsing Barbara Plank Barbara Plank DISI, University of Trento barbara.plank@disi.unitn.it NLP+IR course, spring 2012 Note: Parts of the material in these slides are adapted version of Note: Parts of the material in these slides are adapted version of slides by Jim H. Martin, Dan Jurasky, Christopher Manning
Today Today Moving from words to bigger units g gg • Syntax and Grammars • Why should you care? • Grammars (and parsing) are key components in many NLP applications, e.g. – Information extraction Information extraction – Opinion Mining – Machine translation Machine translation – Question answering
Overview Overview • Key notions that we’ll cover – Constituency – Dependency • Approaches and Resources A h d R – Empirical/Data ‐ driven parsing, Treebank • Ambiguity / The exponential problem g y / p p • Probabilistic Context Free Grammars – CFG and PCFG – CKY algorithm, CNF • Evaluating parser performance • Dependency parsing Dependency parsing
Two views of linguistic structure: 1. Constituency (phrase structure) ( h ) • The basic idea here is that groups of words within utterances g p can be shown to act as single units • For example, it makes sense to the say that the following are all noun phrases in English... ll h i E li h • Why? One piece of evidence is that they can all precede verbs verbs.
Two views of linguistic structure: 1. Constituency (phrase structure) ( h ) • Phrase structure organizes words into nested constituents. g • How do we know what is a constituent? (Not that linguists don’t argue about some cases.) – Distribution: a constituent behaves as a unit that S can appear in different places: • John talked [to the children] [about drugs]. VP • John talked [about drugs] [to the children]. • *John talked drugs to the children about NP – Substitution/expansion/pro ‐ forms: • I sat [on the box/right of the box/there]. N V N N Fed raises interest rates
Headed phrase structure Headed phrase structure To model constituency structure: y • VP … VB* … • NP … NN* … • ADJP … JJ* … • ADVP … RB* … • PP … IN* … • Bracket notation of a tree (Lisp S ‐ structure): (S (NP (N Fed)) (VP (V raises) (NP (N interest) (N rates)))
Two views of linguistic structure: 2. Dependency structure d • In CFG ‐ style phrase ‐ structure grammars the main focus is on constituents. • But it turns out you can get a lot done with binary relations among the lexical items (words) in an utterance. among the lexical items (words) in an utterance. • In a dependency grammar framework, a parse is a tree where – the nodes stand for the words in an utterance – The links between the words represent dependency relations between pairs of words. • Relations may be typed (labeled), or not. dependent head modifier governor Sometimes arcs drawn in opposite direction in opposite direction The boy put the tortoise on the rug ROOT
Two views of linguistic structure: 2. Dependency structure d • Alternative notations (e.g. rooted tree): ( g ) The boy put the tortoise on the rug ROOT put put boy on tortoise rug The the the
Dependency Labels Dependency Labels Argument dependencies: g p • Subject (subj), object (obj), indirect object (iobj)… Modifier dependencies: • Determiner (det), noun modifier (nmod), verbal modifier (vmod), etc. root det subj obj det det A boy paints the wall ROOT
Quiz question Quiz question • In the following sentence, which word is nice a dependent of? g , p There is a nice warm breeze out in the balcony. 1. warm 2. in 3. breeze 4. balcony
Comparison Comparison • Dependency structures explicitly represent p y p y p – head ‐ dependent relations (directed arcs), – functional categories (arc labels). • Phrase structures explicitly represent – phrases (nonterminal nodes), – structural categories (nonterminal labels), ( ) – possibly some functional categories (grammatical functions e g PP ‐ LOC) functions, e.g. PP LOC). • (There exist also hybrid approaches, e.g. Dutch Alpino grammar).
Statistical Natural Language Parsing Parsing: The rise of data and statistics
The rise of data and statistics: Pre 1990 (“Classical”) NLP Parsing (“ l l”) • Wrote symbolic grammar (CFG or often richer) and lexicon S NP VP NN interest NP (DT) NN NNS rates NP NN NNS NNS raises NP NN NNS NNS i NP NNP VBP interest VP V NP VBZ rates • Used grammar/proof systems to prove parses from words • This scaled very badly and didn’t give coverage This scaled very badly and didn t give coverage.
Classical NLP Parsing: The problem and its solution h bl d l • • Categorical constraints can be added to grammars to limit Categorical constraints can be added to grammars to limit unlikely/weird parses for sentences – But the attempt make the grammars not robust • In traditional systems, commonly 30% of sentences in even an edited I t diti l t l 30% f t i dit d text would have no parse. • A less constrained grammar can parse more sentences – But simple sentences end up with ever more parses with no way B t i l t d ith ith to choose between them • We need mechanisms that allow us to find the most likely parse(s) f for a sentence t – Statistical parsing lets us work with very loose grammars that admit millions of parses for sentences but still quickly find the b best parse(s) t ( )
The rise of annotated data: Th The Penn Treebank P T b k [Marcus et al. 1993, Computational Linguistics ] ( (S ( ( (NP ‐ SBJ (DT The) (NN move)) (VP (VBD followed) (NP (NP (DT a) (NN round)) (PP (IN of) ( ( ) (NP (NP (JJ similar) (NNS increases)) (PP (IN by) (NP (JJ other) (NNS lenders))) (PP (IN against) ( ( g ) (NP (NNP Arizona) (JJ real) (NN estate) (NNS loans)))))) (, ,) Most well known part is the (S ‐ ADV (NP ‐ SBJ ( ‐ NONE ‐ *)) Wall Street Journal section of (VP (VBG reflecting) (NP the Penn TreeBank. (NP (DT a) (VBG continuing) (NN decline)) 1 M words from the (PP ‐ LOC (IN in) (NP (DT that) (NN market))))))) 1987 ‐ 1989 Wall Street (. .))) Journal newspaper.
The rise of annotated data The rise of annotated data • Starting off, building a treebank seems a lot slower and less g , g useful than building a grammar • But a treebank gives us many things – Reusability of the labor • Many parsers POS taggers etc • Many parsers, POS taggers, etc. • Valuable resource for linguistics – Broad coverage – Statistics to build parsers – A way to evaluate systems
Statistical Natural Language Parsing An exponential number of attachments
Attachment ambiguities Attachment ambiguities • A key parsing decision is how we ‘attach’ various constituents • A key parsing decision is how we attach various constituents
Attachment ambiguities Attachment ambiguities • How many distinct parses does the following • How many distinct parses does the following sentence have due to PP attachment ambiguities? J ohn wrote the book with a pen in the room. J ohn wrote [the book] [with a pen] [in the room]. J ohn wrote [[the book] [with a pen]] [in the room]. 1 1 J ohn wrote [the book] [[with a pen] [in the room]]. 2 2 J ohn wrote [[the book] [[with a pen] [in the room]]]. 3 5 J ohn wrote [[[the book] [with a pen]] [in the room]]. 4 14 5 42 Catalan numbers: C n = (2 n )!/ [( n + 1)! n !] - an exponentially growing series Catalan numbers: C n (2 n )!/ [( n + 1)! n !] an exponentially growing series 6 132 6 132 7 429 8 1430
Two problems to solve: 1. Avoid repeated work… d d k
Two problems to solve: 1. Avoid repeated work… d d k
Two problems to solve: 2. Ambiguity ‐ Choosing the correct parse bi i h i h S S NP VP NP Papa NP Det N N caviar NP NP PP N spoon NP VP VP V NP V spoon VP VP PP V ate PP P NP P with VP Papa PP Det the Det a V NP P NP Det Det ate N with N the caviar a spoon
Two problems to solve: 2. Ambiguity ‐ Choosing the correct parse bi i h i h S S NP VP NP Papa NP Det N N caviar NP NP PP N spoon NP VP VP V NP V spoon VP VP PP V ate PP P NP P with NP Papa Det the V Det a NP ate PP Det P NP N Det the caviar with N need an efficient algorithm: CKY a spoon
Syntax and Grammars Syntax and Grammars CFGs and PCFGs
A phrase structure grammar A phrase structure grammar Grammar rules Lexicon S NP VP N people VP V NP N fish VP V NP PP N tanks n ‐ ary (n=3) NP NP NP N rods NP NP NP N d bi binary NP NP PP V people unary NP N NP N V fish V fish PP P NP V tanks P with people fish tanks people fish with rods
Phrase structure grammars = Context ‐ free Grammars (CFGs) f ( ) • G = (T, N, S, R) ( , , , ) – T is a set of terminal symbols – N is a set of nonterminal symbols – S is the start symbol (S ∈ N) – R is a set of rules/productions of the form X • X ∈ N and ∈ (N ∪ T)* X N d (N T)* • A grammar G generates a language L. g g g g
Recommend
More recommend