Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 Kuzman Ganchev 1 ca 2 Jo˜ ao Gra¸ Ben Taskar 1 Fernando Pereira 3 1 Computer & Information Science University of Pennsylvania 2 L 2 F INESC-ID, Lisboa, Portugal 3 Google, Inc. July 12, 2010 1/9
Outline A generative dependency parsing model 2/9
Outline A generative dependency parsing model The ambiguity problem this model faces 2/9
Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity 2/9
Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity 2/9
Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity Applying posterior regularization to the likelihood objective 2/9
Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity Applying posterior regularization to the likelihood objective Success with respect to EM and parameter prior baselines 2/9
Dependency model with valence (Klein and Manning, ACL 2004) y x V N ADJ N sparse grammars Regularization creates p θ ( x , y ) = θ root ( V ) 3/9
Dependency model with valence (Klein and Manning, ACL 2004) y x N V ADJ N sparse grammars Regularization creates p θ ( x , y ) = θ root ( V ) · θ stop ( nostop | V , right , false ) · θ child ( N | V , right ) 3/9
Dependency model with valence (Klein and Manning, ACL 2004) y x N V ADJ N sparse grammars Regularization creates p θ ( x , y ) = θ root ( V ) · θ stop ( nostop | V , right , false ) · θ child ( N | V , right ) · θ stop ( stop | V , right , true ) · θ stop ( nostop | V , left , false ) · θ child ( N | V , left ) . . . 3/9
Traditional objective optimization Traditional objective : marginal log likelihood � max L ( θ ) = E X [log p θ ( x )] = E X [log p θ ( x , y )] θ y 4/9
Traditional objective optimization Traditional objective : marginal log likelihood � max L ( θ ) = E X [log p θ ( x )] = E X [log p θ ( x , y )] θ y Optimization method : expectation maximization (EM) 4/9
Traditional objective optimization Traditional objective : marginal log likelihood � max L ( θ ) = E X [log p θ ( x )] = E X [log p θ ( x , y )] θ y Optimization method : expectation maximization (EM) Problem : EM may learn a very ambiguous grammar Too many non-zero probabilities Ex: V → N should have non-zero probability, but V → DET, V → JJ, V → PRP$, etc. should be 0 4/9
Previous approaches to improving performance Structural annealing 1 1 Smith and Eisner, ACL 2006 2 Headden et al., NAACL 2009 3 Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9
Previous approaches to improving performance Structural annealing 1 L ( θ ′ ): Model extension 2 1 Smith and Eisner, ACL 2006 2 Headden et al., NAACL 2009 3 Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9
Previous approaches to improving performance Structural annealing 1 L ( θ ′ ): Model extension 2 L ( θ ) + log p ( θ ): Parameter regularization 3 Tend to reduce unique # of children per parent, rather than directly reducing # of unique parent-child pairs θ child ( Y | X , dir ) � = posterior ( X → Y ) 1 Smith and Eisner, ACL 2006 2 Headden et al., NAACL 2009 3 Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9
Ambiguity measure using posteriors: L 1 / ∞ Intuition : True # of unique parent tags for a child tag is small ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 6/9
Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 6/9
Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 6/9
Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 0 1 0 ADJ V N good grammars Use 6/9
Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 0 1 0 ADJ V N good grammars Use 1 0 0 V ADJ N grammars Use good 6/9
Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 0 1 0 ADJ V N good grammars Use 1 0 0 V ADJ N grammars Use good max ↓ sum = 3 ← 0 1 0 0 1 0 1 0 0 6/9
Measuring ambiguity on distributions over trees For a distribution p θ ( y | x ) instead of gold trees: ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 7/9
Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 7/9
Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 7/9
Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 0.70.3 0 .7 .3 V ADJ N grammars good Use 7/9
Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 0.70.3 0 .7 .3 V ADJ N grammars good Use 0.4 0.6 .4 .6 0 ADJ V N grammars Use good 7/9
Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 0.70.3 0 .7 .3 V ADJ N grammars good Use 0.4 0.6 .4 .6 0 ADJ V N grammars Use good max ↓ sum = 3.3 ← 0 1 .3 .4 .6 0 .4 .6 0 7/9
Minimizing ambiguity through posterior regularization q t ( y | x ) = arg min E-Step KL ( q � p θ t ) q ( y | x ) 8/9
Minimizing ambiguity through posterior regularization q t ( y | x ) = arg min E-Step KL ( q � p θ t ) q ( y | x ) parent D N V N Probability D N V N q ( y | x ) = D ♣ ✈ ♣ ✈ r s t → child ♣ t ① t N ✉ ♣ ✉ r q ( root → x i ) V ♣ ♣ ♣ ♣ N ✉ r ✉ ♣ q ( x i → x j ) 8/9
Minimizing ambiguity through posterior regularization Apply E-step penalty L 1 / ∞ on posteriors q ( y | x ) to induce sparsity (Graca et al., NIPS 2007 & 2009) q t ( y | x ) = arg min E-Step KL ( q � p θ t ) + σ L 1 / ∞ ( q ( y | x )) q ( y | x ) parent D N V N Probability D N V N q ( y | x ) = D ♣ ✈ ♣ ✈ r s t → child ♣ t ① t N ♣ ♣ ② r q ( root → x i ) V ♣ ♣ ♣ ♣ N ♣ r ② ♣ q ( x i → x j ) 8/9
Experimental results English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR ( σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR ( σ = 140, λ = 1 / 3) 64.4 55.2 50.5 DD ( α = 1, λ learned) 65.0 ( ± 5 . 7 ) 9/9
Experimental results English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR ( σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR ( σ = 140, λ = 1 / 3) 64.4 55.2 50.5 DD ( α = 1, λ learned) 65.0 ( ± 5 . 7 ) 11 other languages from CoNLL-X: Dirichlet prior baseline: 1.5% average gain over EM Posterior regularization: 6.5% average gain over EM 9/9
Experimental results English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR ( σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR ( σ = 140, λ = 1 / 3) 64.4 55.2 50.5 DD ( α = 1, λ learned) 65.0 ( ± 5 . 7 ) 11 other languages from CoNLL-X: Dirichlet prior baseline: 1.5% average gain over EM Posterior regularization: 6.5% average gain over EM Come see the poster for more details 9/9
Recommend
More recommend