Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 - PowerPoint PPT Presentation

Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 Kuzman Ganchev 1 ca 2 Jo˜ ao Gra¸ Ben Taskar 1 Fernando Pereira 3 1 Computer & Information Science University of Pennsylvania 2 L 2 F INESC-ID, Lisboa, Portugal 3 Google, Inc. July 12, 2010 1/9

Outline A generative dependency parsing model 2/9

Outline A generative dependency parsing model The ambiguity problem this model faces 2/9

Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity 2/9

Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity 2/9

Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity Applying posterior regularization to the likelihood objective 2/9

Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity Applying posterior regularization to the likelihood objective Success with respect to EM and parameter prior baselines 2/9

Dependency model with valence (Klein and Manning, ACL 2004) y x V N ADJ N sparse grammars Regularization creates p θ ( x , y ) = θ root ( V ) 3/9

Dependency model with valence (Klein and Manning, ACL 2004) y x N V ADJ N sparse grammars Regularization creates p θ ( x , y ) = θ root ( V ) · θ stop ( nostop | V , right , false ) · θ child ( N | V , right ) 3/9

Dependency model with valence (Klein and Manning, ACL 2004) y x N V ADJ N sparse grammars Regularization creates p θ ( x , y ) = θ root ( V ) · θ stop ( nostop | V , right , false ) · θ child ( N | V , right ) · θ stop ( stop | V , right , true ) · θ stop ( nostop | V , left , false ) · θ child ( N | V , left ) . . . 3/9

Traditional objective optimization Traditional objective : marginal log likelihood � max L ( θ ) = E X [log p θ ( x )] = E X [log p θ ( x , y )] θ y 4/9

Traditional objective optimization Traditional objective : marginal log likelihood � max L ( θ ) = E X [log p θ ( x )] = E X [log p θ ( x , y )] θ y Optimization method : expectation maximization (EM) 4/9

Traditional objective optimization Traditional objective : marginal log likelihood � max L ( θ ) = E X [log p θ ( x )] = E X [log p θ ( x , y )] θ y Optimization method : expectation maximization (EM) Problem : EM may learn a very ambiguous grammar Too many non-zero probabilities Ex: V → N should have non-zero probability, but V → DET, V → JJ, V → PRP$, etc. should be 0 4/9

Previous approaches to improving performance Structural annealing 1 1 Smith and Eisner, ACL 2006 2 Headden et al., NAACL 2009 3 Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9

Previous approaches to improving performance Structural annealing 1 L ( θ ′ ): Model extension 2 1 Smith and Eisner, ACL 2006 2 Headden et al., NAACL 2009 3 Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9

Previous approaches to improving performance Structural annealing 1 L ( θ ′ ): Model extension 2 L ( θ ) + log p ( θ ): Parameter regularization 3 Tend to reduce unique # of children per parent, rather than directly reducing # of unique parent-child pairs θ child ( Y | X , dir ) � = posterior ( X → Y ) 1 Smith and Eisner, ACL 2006 2 Headden et al., NAACL 2009 3 Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9

Ambiguity measure using posteriors: L 1 / ∞ Intuition : True # of unique parent tags for a child tag is small ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 6/9

Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 6/9

Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 6/9

Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 0 1 0 ADJ V N good grammars Use 6/9

Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 0 1 0 ADJ V N good grammars Use 1 0 0 V ADJ N grammars Use good 6/9

Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 0 1 0 ADJ V N good grammars Use 1 0 0 V ADJ N grammars Use good max ↓ sum = 3 ← 0 1 0 0 1 0 1 0 0 6/9

Measuring ambiguity on distributions over trees For a distribution p θ ( y | x ) instead of gold trees: ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 7/9

Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 7/9

Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 7/9

Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 0.70.3 0 .7 .3 V ADJ N grammars good Use 7/9

Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 0.70.3 0 .7 .3 V ADJ N grammars good Use 0.4 0.6 .4 .6 0 ADJ V N grammars Use good 7/9

Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 0.70.3 0 .7 .3 V ADJ N grammars good Use 0.4 0.6 .4 .6 0 ADJ V N grammars Use good max ↓ sum = 3.3 ← 0 1 .3 .4 .6 0 .4 .6 0 7/9

Minimizing ambiguity through posterior regularization q t ( y | x ) = arg min E-Step KL ( q � p θ t ) q ( y | x ) 8/9

Minimizing ambiguity through posterior regularization q t ( y | x ) = arg min E-Step KL ( q � p θ t ) q ( y | x ) parent D N V N Probability D N V N q ( y | x ) = D ♣ ✈ ♣ ✈ r s t → child ♣ t ① t N ✉ ♣ ✉ r q ( root → x i ) V ♣ ♣ ♣ ♣ N ✉ r ✉ ♣ q ( x i → x j ) 8/9

Minimizing ambiguity through posterior regularization Apply E-step penalty L 1 / ∞ on posteriors q ( y | x ) to induce sparsity (Graca et al., NIPS 2007 & 2009) q t ( y | x ) = arg min E-Step KL ( q � p θ t ) + σ L 1 / ∞ ( q ( y | x )) q ( y | x ) parent D N V N Probability D N V N q ( y | x ) = D ♣ ✈ ♣ ✈ r s t → child ♣ t ① t N ♣ ♣ ② r q ( root → x i ) V ♣ ♣ ♣ ♣ N ♣ r ② ♣ q ( x i → x j ) 8/9

Experimental results English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR ( σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR ( σ = 140, λ = 1 / 3) 64.4 55.2 50.5 DD ( α = 1, λ learned) 65.0 ( ± 5 . 7 ) 9/9

Experimental results English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR ( σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR ( σ = 140, λ = 1 / 3) 64.4 55.2 50.5 DD ( α = 1, λ learned) 65.0 ( ± 5 . 7 ) 11 other languages from CoNLL-X: Dirichlet prior baseline: 1.5% average gain over EM Posterior regularization: 6.5% average gain over EM 9/9

Experimental results English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR ( σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR ( σ = 140, λ = 1 / 3) 64.4 55.2 50.5 DD ( α = 1, λ learned) 65.0 ( ± 5 . 7 ) 11 other languages from CoNLL-X: Dirichlet prior baseline: 1.5% average gain over EM Posterior regularization: 6.5% average gain over EM Come see the poster for more details 9/9

Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 - PowerPoint PPT Presentation

Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 Kuzman Ganchev 1 ca 2 Jo ao Gra Ben Taskar 1 Fernando Pereira 3 1 Computer & Information Science University of Pennsylvania 2 L 2 F INESC-ID, Lisboa, Portugal 3 Google, Inc.

Working Together What does his future hold? Carres Grammar School Carres Grammar School

Induction Stepwise induction (for T PA , T cons ) Complete induction (for T PA , T cons )

Dependency Grammar Introduction to Dependency Grammar Not a coherent grammatical framework:

Dependency Grammar Thanks to Detmar Meurers, Markus Dickinson, Joakim Nivre and Sandra K

Dependency Grammar Overview Dependency Grammar (DG) (1) Small birds sing loud songs Not a

Extensible Dependency Grammar: A Modular Grammar Formalism Based On Multigraph Description Ralph

Induction and recursion Chapter 5 Chapter Summary Mathematical Induction Strong Induction

Grammar and word order Grammar and word order Grammar Grammar Includes morphology and syntax

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Dependency Grammars Topological Dependency Trees: A Constraint-based Account of Linear

Lecture 17: Dependency Grammar Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Mathematical Induction Lecture 10-11 Menu Mathematical Induction Strong Induction

MA THEMA TICAL INDUCTION Induction and Deduction Mathematical Induction (its

Beyond Inductive Definitions Induction-Recursion, Induction-Induction, Coalgebras Anton

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

[show me your privileges and I will lead you to SYSTEM] Andrea Pierini, Paris, June 19 th 2019 1

A Low-budget Tagger for Old Czech Jirka Hana 1 Anna Feldman 2 Katsiaryna Aharodnik 2 1 Charles

Rogue Femtocell Owners: How Mallory Can Monitor My Devices David Malone, Darren F. Kavanagh and

Reranking and Self-Training for Parser Adaptation David McClosky, Eugene Charniak, and Mark

Dynamic Compilation and Optimization of Packet Processing Programs Gbor Rtvri, Lszl

Network Routing Hatem Takruri, Ibrahim Kettaneh , Ahmed Alquraan, Samer Al-Kiswany 1 In

B4 and After : Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in

Li Xiong CS573 Data Privacy and Security