subdomain sensitive statistical parsing
play

Subdomain Sensitive Statistical Parsing Barbara Plank & Khalil - PowerPoint PPT Presentation

Subdomain Sensitive Statistical Parsing using Raw Corpora Subdomain Sensitive Statistical Parsing Barbara Plank & Khalil Simaan using Raw Corpora Introduction and Motivation Subdomain Sensitive Barbara Plank 1 and Khalil Simaan


  1. Subdomain Sensitive Statistical Parsing using Raw Corpora Subdomain Sensitive Statistical Parsing Barbara Plank & Khalil Sima’an using Raw Corpora Introduction and Motivation Subdomain Sensitive Barbara Plank 1 and Khalil Sima’an 2 Statistical Parsing Subdomain Sensitive Parsers Parser Combination 1 Alfa Informatica, Faculty of Arts Techniques University of Groningen, The Netherlands Experiments and Results b.plank@rug.nl 2 Language and Computation, Faculty of Science Conclusions and Future Work University of Amsterdam, The Netherlands simaan@science.uva.nl LREC 2008 Marrakech, Morocco 1 / 21

  2. Subdomain Outline Sensitive Statistical Parsing using Raw Corpora Barbara Plank & Khalil Sima’an Introduction and Motivation Introduction and 1 Motivation Subdomain Sensitive Subdomain Sensitive Statistical Parsing using Raw Corpora Statistical Parsing 2 Subdomain Sensitive Subdomain Sensitive Parsers Parsers Parser Combination Techniques Parser Combination Techniques Experiments and Results Conclusions and Future Work Experiments and Results 3 Conclusions and Future Work 4 2 / 21

  3. Subdomain Statistical parsing Sensitive Statistical Parsing using Raw Corpora Barbara Plank & Khalil Sima’an Introduction and Motivation Problem: Ambiguity of Subdomain Sensitive natural language sentences Statistical Parsing Subdomain Sensitive Common approach: Train a Parsers Parser Combination parser/model on a treebank. Techniques Experiments and Apply to new input. Results Variations: Conclusions and Future Work phrase/dependency structure, formal grammar, statistical model and estimator. 3 / 21

  4. Subdomain Motivation Sensitive Statistical Parsing using Raw Corpora Barbara Plank & Is there more in a treebank that we might exploit? Khalil Sima’an We view a treebank as a mixture of subdomains, each Introduction and Motivation addressing certain concepts more than others Subdomain ”politics, stock market, financial news etc. can be Sensitive Statistical Parsing found in the WSJ“ (Kneser and Peters, 1997) Subdomain Sensitive Parsers The parsing statistics gathered from the treebank are Parser Combination Techniques averages over different subdomains, Experiments and Results Averages smooth out the differences between Conclusions and Future Work subdomains and weaken the biases 1 Do subdomains matter? 2 How to incorporate subdomain sensitivity into an existing state-of-the-art parser? 4 / 21

  5. Subdomain Motivation - Our Approach Sensitive Statistical Parsing using Raw Corpora Barbara Plank & Khalil Sima’an Introduction and Subdomains { c i } as hidden features Motivation Subdomain � P ( s , t ) = P ( s , c i ) P ( t | s , c i ) (1) Sensitive Statistical Parsing i Subdomain Sensitive Parsers This work: approximate it by creating an ensemble of parsers Parser Combination Techniques Experiments and Results Assumptions: Conclusions and Future Work We know a set of subdomains { c i , . . . , c k } Approximate � i by combining predictions of subdomains parsers 5 / 21

  6. Subdomain Overview and Problem Statement Sensitive Statistical Parsing using Raw Corpora Barbara Plank & Khalil Sima’an Introduction and Motivation Subdomain Sensitive Statistical Parsing Subdomain Sensitive Parsers Parser Combination Techniques Experiments and Results Conclusions and Future Work 6 / 21

  7. Subdomain Creating subdomain-specific parsers Sensitive Statistical Parsing using Raw Corpora Barbara Plank & Weight the trees in treebank TB with subdomain statistics Khalil Sima’an Use domain-dependent raw corpus C (flat sentences) Introduction and Motivation Induce statistical Language Model (LM) θ from C Subdomain Sensitive Assign a count f to every tree π i ∈ TB such that: Statistical Parsing Subdomain Sensitive f = average per-word “count” of yield y [ π i ] under LM θ Parsers Parser Combination Techniques Experiments and Results Conclusions and Future Work Retrain parser on subdomain-weighted TB θ . 7 / 21

  8. Subdomain Overview of our approach - Details Sensitive Statistical Parsing using Raw Corpora Barbara Plank & Khalil Sima’an Introduction and Motivation Subdomain Sensitive Statistical Parsing Subdomain Sensitive Parsers Parser Combination Techniques Experiments and Results Conclusions and Future Work 8 / 21

  9. Subdomain Parser Combination Techniques Sensitive Statistical Parsing using Raw Corpora How to combine them? Barbara Plank & Khalil Sima’an Introduction and Motivation Subdomain Sensitive Statistical Parsing Subdomain Sensitive Parsers Parser Combination Techniques Experiments and Results Conclusions and Future Work 9 / 21

  10. Subdomain Parser Combination Techniques Sensitive Statistical Parsing using Raw Corpora How to combine them? Barbara Plank & Khalil Sima’an Introduction and Motivation Subdomain Sensitive Statistical Parsing Subdomain Sensitive Parsers Parser Combination Techniques Experiments and Results Conclusions and Future Work Parser Pre-selection: Parser Post-selection: selecting a parser selecting a parser after up-front (given: s ) parsing (given: s , t ) 9 / 21

  11. Subdomain Pre-selection: Divergence Model (DVM) Sensitive Statistical Parsing using Raw Corpora Barbara Plank & We measure for every word how well it discriminates Khalil Sima’an between the subdomains using the notion of divergence. Introduction and Motivation The divergence of a word w in a subdomain i ∈ [1 . . . k ], Subdomain from all other ( k − 1) subdomains ( j ∈ [1 . . . k ] , j � = i ): Sensitive Statistical Parsing Subdomain Sensitive Parsers p θ i ( w ) � j � = i | log p θ j ( w ) | Parser Combination Techniques divergence i ( w ) = 1 + (2) Experiments and ( k − 1) Results Conclusions and � n Future Work x =1 divergence i ( w x ) divergence sent i ( w n 1 ) = (3) n Boundary issues: if p θ i ( w ) = 0 then divergence i ( w ) = 1, and if p θ j ( w ) = 0, then p θ j ( w ) = 10 − 15 (constant). 10 / 21

  12. Subdomain Pre-selection: Divergence Model (DVM) - Example Sensitive Statistical Parsing using Raw Corpora For example, ’multi-million-dollar’ (score financial Barbara Plank & Khalil Sima’an domain: 5.5), ’equal’ (score all domains from 1.6 to 1.9) Introduction and Motivation Subdomain 7 Politics Sensitive Financial Statistical Parsing Sports 6 WSJ Subdomain Sensitive Parsers Parser Combination Techniques 5 Experiments and Results 4 Conclusions and Future Work 3 2 1 S m f s e 1 N e v s c p f l f K k C p a t B f A I a B A i e u l u e h i U n e e x 4 m u p u e u e e o n l r - t a s q u s e x r a i a e t v c l f p 2 t l n o m t c u t n y l r t r t y P e r c e u l i o - D a a t i c v t i d n l w u t a 6 e r i r l w a s i n p A n e s a - i n i n i t u b a i s h i u - t r n 3 i l i r s y n b g o e e C d e n t l i m i s s r n a c l a y t e a a b i t s e t i i s l e n e g d g i i u i g b a t e o r i e s e l t v c i s n t n d a e a l s r i s v s n s i s o i a e t l d e i n n r t - i t e s i n t n n t s i s r g r i e a g - c i a d s n t d e i o o g i n n l l g a r 11 / 21

  13. Subdomain Post-Selection: Node Weighting + DVM (NW-DVM) Sensitive Statistical Parsing using Raw Corpora Barbara Plank & For parse tree π i with 1 ≤ i ≤ k and sentence w n 1 : Khalil Sima’an Introduction and � k � Motivation 1 � score ( c ) = δ [ c , π i ] (4) Subdomain k Sensitive i =1 Statistical Parsing Subdomain Sensitive Parsers Parser Combination Techniques � � 1 � Experiments and + λ ∗ divergence sent i ( w n score ( π i ) = (1 − λ ) score ( c ) 1 ) Results | π i | c ∈ π i Conclusions and Future Work (5) where | π i | is the size of the constituent set, and 0 < λ < 1 an interpolation factor. How well does the parse tree π i fit the domain? How well does w n 1 fit the domain? 12 / 21

  14. Subdomain First Experiment: Variance among Parsers Sensitive Statistical Parsing using Raw Corpora Are subdomain parsers complementary? Barbara Plank & Khalil Sima’an Optimal decision procedure - an oracle: Introduction and Motivation π best oracle = argmax i f F-score( π i ) (6) Subdomain Sensitive Statistical Parsing Subdomain Sensitive Parsers Parser Combination Techniques Experiments and Results Conclusions and Future Work 13 / 21

  15. Subdomain First Experiment: Variance among Parsers Sensitive Statistical Parsing using Raw Corpora Are subdomain parsers complementary? Barbara Plank & Khalil Sima’an Optimal decision procedure - an oracle: Introduction and Motivation π best oracle = argmax i f F-score( π i ) (6) Subdomain Sensitive ≤ 40 Statistical Parsing Subdomain Sensitive Parser LR LP F-score Parsers Parser Combination Section 00 (development set) Techniques Baseline 89.44 89.63 89.53 Experiments and Results Sports 88.95 88.83 88.89 Conclusions and Financial 89.01 88.84 88.92 Future Work Politics 88.86 88.70 88.78 Oracle combination 90.59 90.66 90.62 Improvement over baseline +1.15 +1.03 +1.09 Section 23 (test set) Baseline 88.77 88.87 88.82 Oracle combination 90.11 90.11 90.11 Improvement over baseline +1.34 +1.24 +1.29 13 / 21

Recommend


More recommend