natural language processing cse 517 dependency structure
play

Natural Language Processing (CSE 517): Dependency Structure Noah - PowerPoint PPT Presentation

Natural Language Processing (CSE 517): Dependency Structure Noah Smith 2016 c University of Washington nasmith@cs.washington.edu February 24, 2016 1 / 45 Why might you want to use a generative classifier, such as Naive Bayes, as opposed


  1. Natural Language Processing (CSE 517): Dependency Structure Noah Smith � 2016 c University of Washington nasmith@cs.washington.edu February 24, 2016 1 / 45

  2. Why might you want to use a generative classifier, such as Naive Bayes, as opposed to a discriminative classifier, and vice versa? How can one deal with out-of-vocabulary words at test time when one is applying an HMM for POS tagging or a PCFG for parsing? What is marginal inference, and how can it be carried out on a factor graph? What are the advantages and disadvantages of using a context-free grammar in Chomsky normal form? 2 / 45

  3. Starting Point: Phrase Structure S NP NP VP JJ NN DT NN NN NN VBD NP PP last year The luxury auto maker CD NN sold IN NP 1,214 cars in DT NNP the U.S. 3 / 45

  4. Parent Annotation (Johnson, 1998) S ROOT NP S NP S VP S JJ NP NN NP DT NP NN NP NN NP NN NP last year VBD VP NP VP PP VP The luxury auto maker CD NP NN NP sold IN PP NP PP 1,214 cars DT NP NNP NP in the U.S. Increases the “vertical” Markov order: p ( children | parent , grandparent ) 4 / 45

  5. Headedness S NP NP VP JJ NN DT NN NN NN last year NP PP VBD The luxury auto maker CD NN sold NP IN 1,214 cars DT in NNP the U.S. Suggests “horizontal” markovization: � p ( children | parent ) = p ( head | parent ) · p ( i th sibling | head , parent ) i 5 / 45

  6. Lexicalization S sold NP maker NP year VP sold JJ last NN year DT The NN luxury NN auto NN maker last year VBD sold NP cars PP in The luxury auto maker sold CD 1,214 NN cars IN in NP U.S. 1,214 cars in DT the NNP U.S. the U.S. Each node shares a lexical head with its head child. 6 / 45

  7. Transformations on Trees Starting around 1998, many different ideas—both linguistic and statistical—about how to transform treebank trees. All of these make the grammar larger—and therefore all frequencies became sparser—so a lot of research on smoothing the probability rules. Parent annotation, headedness, markovization, and lexicalization; also category refinement by linguistic rules (Klein and Manning, 2003). ◮ These are reflected in some versions of the popular Stanford and Berkeley parsers. 7 / 45

  8. Tree Decorations (Klein and Manning, 2003) ◮ Mark nodes with only 1 child as UNARY ◮ Mark DTs (determiners), RBs (adverbs) when they are only children ◮ Annotate POS tags with their parents ◮ Split IN (prepositions; 6 ways), AUX, CC, % ◮ NPs: temporal, possessive, base ◮ VPs annotated with head tag (finite vs. others) ◮ DOMINATES-V ◮ RIGHT-RECURSIVE NP 8 / 45

  9. Machine Learning and Parsing 9 / 45

  10. Machine Learning and Parsing ◮ Define arbitrary features on trees, based on linguistic knowledge; to parse, use a PCFG to generate a k-best list of parses, then train a log-linear model to rerank (Charniak and Johnson, 2005). ◮ K-best parsing: Huang and Chiang (2005) 10 / 45

  11. Machine Learning and Parsing ◮ Define arbitrary features on trees, based on linguistic knowledge; to parse, use a PCFG to generate a k-best list of parses, then train a log-linear model to rerank (Charniak and Johnson, 2005). ◮ K-best parsing: Huang and Chiang (2005) ◮ Define rule-local features on trees (and any part of the input sentence); minimize hinge or log loss. ◮ These exploit dynamic programming algorithms for training (CKY for arbitrary scores, and the sum-product version). 11 / 45

  12. Structured Perceptron Collins (2002) Perceptron algorithm for parsing : ◮ For t ∈ { 1 , . . . , T } : ◮ Pick i t uniformly at random from { 1 , . . . , n } . ◮ ˆ t i t ← argmax w · Φ ( x i t , t ) t ∈T x it � � ◮ w ← w − α Φ ( x i t , ˆ t i t ) − Φ ( x i t , t i t ) This can be viewed as stochastic subgradient descent on the structured hinge loss: n � max w · Φ ( x i , t ) − w · Φ ( x i , t i ) t ∈T x it � �� � i =1 hope � �� � fear 12 / 45

  13. Beyond Structured Perceptron (I) Structured support vector machine (also known as max margin parsing ; Taskar et al., 2004): n � max w · Φ ( x i , t ) + cost( t i t , t ) − w · Φ ( x i , t i ) t ∈T x it � �� � i =1 hope � �� � fear where cost( t i , t ) is the number of local errors (either constituent errors or “rule” errors). 13 / 45

  14. Beyond Structured Perceptron (II) Log-loss, which gives parsing models analogous to conditional random fields (Miyao and Jun’ichi, 2002; Finkel et al., 2008): n � � log exp w · Φ ( x i , t ) − w · Φ ( x i , t i ) � �� � i =1 t ∈T x i hope � �� � fear 14 / 45

  15. Machine Learning and Parsing ◮ Define arbitrary features on trees, based on linguistic knowledge; to parse, use a PCFG to generate a k-best list of parses, then train a log-linear model to rerank (Charniak and Johnson, 2005). ◮ K-best parsing: Huang and Chiang (2005) ◮ Define rule-local features on trees (and any part of the input sentence); minimize hinge or log loss. ◮ These exploit dynamic programming algorithms for training (CKY for arbitrary scores, and the sum-product version). ◮ Learn refinements on the constituents, as latent variables (Petrov et al., 2006). 15 / 45

  16. Machine Learning and Parsing ◮ Define arbitrary features on trees, based on linguistic knowledge; to parse, use a PCFG to generate a k-best list of parses, then train a log-linear model to rerank (Charniak and Johnson, 2005). ◮ K-best parsing: Huang and Chiang (2005) ◮ Define rule-local features on trees (and any part of the input sentence); minimize hinge or log loss. ◮ These exploit dynamic programming algorithms for training (CKY for arbitrary scores, and the sum-product version). ◮ Learn refinements on the constituents, as latent variables (Petrov et al., 2006). ◮ Neural, too: ◮ Socher et al. (2013) define compositional vector grammars that associate each phrase with a vector, calculated as a function of its subphrases’ vectors. Used essentially to rerank. ◮ Dyer et al. (2016): recurrent neural network grammars , generative models like PCFGs that encode arbitrary previous derivation steps in a vector. Parsing requires some tricks. 16 / 45

  17. Dependencies Informally, you can think of dependency structures as a transformation of phrase-structures that ◮ maintains the word-to-word relationships induced by lexicalization, ◮ adds labels to them, and ◮ eliminates the phrase categories. There are also linguistic theories built on dependencies (Tesni` ere, 1959; Mel’ˇ cuk, 1987), as well as treebanks corresponding to those. ◮ Free(r)-word order languages (e.g., Czech) 17 / 45

  18. Dependency Tree: Definition Let x = � x 1 , . . . , x n � be a sentence. Add a special root symbol as “ x 0 .” A dependency tree consists of a set of tuples � p, c, ℓ � , where ◮ p ∈ { 0 , . . . , n } is the index of a parent ◮ c ∈ { 1 , . . . , n } is the index of a child ◮ ℓ ∈ L is a label Different annotation schemes define different label sets L , and different constraints on the set of tuples. Most commonly: ◮ The tuple is represented as a directed edge from x p to x c with label ℓ . ◮ The directed edges form an arborescence (directed tree) with x 0 as the root. 18 / 45

  19. Example S NP VP Pronoun Verb NP we wash Determiner Noun our cats Phrase-structure tree. 19 / 45

  20. Example S NP VP Pronoun NP Verb we wash Determiner Noun our cats Phrase-structure tree with heads. 20 / 45

  21. Example S wash NP we VP wash Pronoun we NP cats Verb wash we wash Determiner our Noun cats our cats Phrase-structure tree with heads, lexicalized. 21 / 45

  22. Example we wash our cats “Bare bones” dependency tree. 22 / 45

  23. Example we wash our cats who stink 23 / 45

  24. Example we vigorously wash our cats who stink 24 / 45

  25. Example we vigorously wash our cats and dogs who stink The bugbear of dependency syntax: coordination structures. 25 / 45

  26. Example we vigorously wash our cats and dogs who stink Make the first conjunct the head? 26 / 45

  27. Example we vigorously wash our cats and dogs who stink Make the coordinating conjunction the head? 27 / 45

  28. Example we vigorously wash our cats and dogs who stink Make the second conjunct the head? 28 / 45

  29. Dependency Schemes ◮ Transform the treebank: define “head rules” that can select the head child of any node in a phrase-structure tree and label the dependencies. ◮ More powerful, less local rule sets, possibly collapsing some words into arc labels. ◮ Stanford dependencies are a popular example (de Marneffe et al., 2006). ◮ Direct annotation. 29 / 45

  30. Dependencies and Grammar Context-free grammars can be used to encode dependency structures. For every head word and constellation of dependent children: N head → N leftmost-sibling . . . N head . . . N rightmost-sibling And for every head word: N head → head A bilexical dependency grammar binarizes the dependents, generating only one per rule, usually “outward” from the head. Such a grammar can produce only projective trees, which are (informally) trees in which the arcs don’t cross. 30 / 45

  31. Quick Reminder: CKY p ( x i | N ) L R N i j j + 1 k p ( L R | N ) i i N i k S goal: 1 n Each “triangle” item corresponds to a buildable phrase. 31 / 45

  32. CKY Example goal S VP NP Pronoun Verb Poss. Noun we wash our cats 32 / 45

Recommend


More recommend