a supertag context model for weakly supervised ccg parser
play

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning - PowerPoint PPT Presentation

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning Dan Garrette U. Washington Chris Dyer CMU Jason Baldridge UT-Austin Noah A. Smith CMU Contributions 1. A new generative model for learning CCG parsers from weak


  1. A Supertag-Context Model for Weakly-Supervised CCG Parser Learning Dan Garrette U. Washington Chris Dyer CMU Jason Baldridge UT-Austin Noah A. Smith CMU

  2. Contributions 1. A new generative model for learning CCG parsers from weak supervision 2. A way to select Bayesian priors that capture properties of CCG 3. A Bayesian inference procedure to learn the parameters of our model

  3. Type-Level Supervision • Unannotated text • Incomplete tag dictionary: word � {tags}

  4. Type-Level Supervision the lazy dogs wander np/n n/n n np np (s\np)/np

  5. Type-Level Supervision the lazy dogs wander np/n n/n n n np np n/n (s\np)/np np/n s\np …

  6. Type-Level Supervision ? the lazy dogs wander np/n n/n n n np np n/n (s\np)/np np/n s\np …

  7. PCFG: Local Decisions

  8. PCFG: Local Decisions A

  9. PCFG: Local Decisions A B C

  10. PCFG: Local Decisions A B C

  11. PCFG: Local Decisions A B C D E F G

  12. PCFG: Local Decisions A B C D E F G B C B C P( D E | P( F G | ) )

  13. PCFG: Local Decisions A B C D E F G B C B C P( D E | P( F G | ) )

  14. A New Generative Model A B C D E F G B B P( D E | )

  15. A New Generative Model A B C D E F G B B P( D E | × P( | ) ) F B B R

  16. A New Generative Model A B C D E F G <S> B B P( D E | × P( | ) × P( | ) ) F B B B S B R L

  17. A New Generative Model A B C D E F G <E> <S> (This makes inference tricky… we’ll come back to that)

  18. Why CCG? • The grammar formalism itself can be used to guide learning • Given any two categories, we always know whether they are combinable. • We can extract a priori context preferences, before we even look at the data • Adjacent categories tend to be combinable.

  19. Why CCG? S s NP VP np VB s/np DT NN np/n n ? buy the book buy the book universal, intrinsic all relationships grammar properties must be learned

  20. CCG Parsing s np n n / n s np / n n \ np FA sleeps the lazy dog

  21. CCG Parsing s np np/n n / n s np / n n \ np FC sleeps the lazy dog

  22. Supertag Context n /n n s np np np / n n n s \ np sleeps the lazy dog

  23. Supertag Context n n /n n s np np np / n n n s \ np sleeps the lazy dog

  24. Supertag Context n n /n n s np np np / n n n s \ np sleeps the lazy dog

  25. Supertag Context s np np np / n n n s \ np sleeps the lazy dog

  26. Supertag Context n n /n n s np np np / n n n s \ np sleeps the lazy dog

  27. Constituent Context • Klein & Manning showed the value of modeling context with the Constituent Context Model (CCM) sleeps the lazy dog [Klein & Manning 2002]

  28. Constituent Context DT ( JJ NN ) VBZ [Klein & Manning 2002]

  29. Constituent Context “substitutability” DT ( JJ NN ) VBZ lazy dog [Klein & Manning 2002]

  30. Constituent Context “substitutability” DT ( NN ) VBZ dog [Klein & Manning 2002]

  31. Constituent Context “substitutability” DT ( JJ JJ NN ) VBZ big lazy dog [Klein & Manning 2002]

  32. Constituent Context “substitutability” ~Noun DT ( ) VBZ [Klein & Manning 2002]

  33. Constituent Context “substitutability” DT ( ) VBZ [Klein & Manning 2002]

  34. Supertag Context n ( n /n n s np n np / n ) s \ np sleeps the lazy dog

  35. Supertag Context • We know the constituent label • We know if it’s a fitting context, even before looking at the data n ( s np np n / ) s \ np sleeps the

  36. This Paper 1. A new generative model for learning CCG parsers from weak supervision 2. A way to select Bayesian priors that capture properties of CCG 3. A Bayesian inference procedure to learn the parameters of our model

  37. Supertag-Context Parsing A 04 Standard PCFG A 03 P(A root ) P(A → A left A right OR w i ) A 13 t 1 t 2 t 3 t 4 w 1 w 2 w 3 w 4 0 1 2 3 4

  38. Supertag-Context Parsing A 04 With Context A 03 P(A root ) P(A → A left A right OR w i ) A 13 P(A → t left ) t 1 t 2 t 3 t 4 <s> <e> P(A → t right ) w 1 w 2 w 3 w 4 0 1 2 3 4

  39. Prior on Categories np np np np\(np/n) n np/n (np\(np/n))/n n np/n n/n n the lazy dog the lazy dog [Garrette, Dyer, Baldridge, and Smith, 2015]

  40. Supertag-Context Prior { 10 5 if t left can combine with A ∝ P L-prior (t left | A) 1 otherwise A ? t left t right sleeps the lazy dog

  41. Supertag-Context Prior P R-prior (t right | A) { 10 5 if A can combine with t right ∝ 1 otherwise n ? t left t right sleeps the lazy dog

  42. This Paper 1. A new generative model for learning CCG parsers from weak supervision 2. A way to select Bayesian priors that capture properties of CCG 3. A Bayesian inference procedure to learn the parameters of our model

  43. Type-Level Supervision ? the lazy dogs wander np/n n/n n np np (s\np)/np

  44. Type-Supervised Learning unlabeled corpus tag dictionary universal properties of the CCG formalism

  45. Posterior Inference • A Bayesian inference procedure will make use of our linguistically-informed priors • But we can’t do sampling like a PCFG • Can’t compute the inside chart, even with dynamic programming.

  46. Sampling via Metropolis-Hastings Idea: • Sample tree from an efficient proposal distribution • (PCFG parameters) (Johnson et al. 2007) • Accept according to the full distribution • (Context parameters)

  47. Posterior Inference Priors (prefer connections) the lazy dogs wander np/n n/n n Model np np (s\np)/np

  48. Posterior Inference Priors (prefer connections) the lazy dogs wander np/n n/n n Model np np (s\np)/np

  49. Posterior Inference Priors (prefer connections) the lazy dogs wander np/n n/n n n Model np np n/n (s\np)/np np/n s\np …

  50. Posterior Inference Priors Inside (prefer connections) the lazy dogs wander np/n n/n n n Model np np n/n (s\np)/np np/n s\np …

  51. Posterior Inference Sample Priors (prefer connections) the lazy dogs wander np/n n/n n n Model np np n/n (s\np)/np np/n s\np …

  52. Metropolis-Hastings Priors (prefer connections) Model

  53. Metropolis-Hastings Priors Existing Tree (prefer connections) New Tree Model

  54. Metropolis-Hastings Priors Existing Tree (prefer connections) New Tree Model

  55. Metropolis-Hastings Priors Existing Tree (prefer connections) New Tree Model

  56. Metropolis-Hastings Priors (prefer connections) Model

  57. Posterior Inference Priors (prefer connections) Model

  58. Metropolis-Hastings • Sample tree based only on the pcfg parameters • Accept based only on the context • New worse than old => less likely to accept

  59. Experimental Results

  60. Experimental Question • When supervision is incomplete, does modeling context, and biasing toward combining contexts, help learn better parsing models?

  61. English Results 75 no context +context combinability 65 64 61 64 63 60 56 60 60 59 55 58 parsing accuracy 50 25 0 250k 200k 150k 100k 50k 25k size of the corpus from which the tag dictionary is drawn

  62. Experimental Results 60 no context 58 +context combinability 55 54 52 parsing accuracy 40 34 29 20 0 English Italian Chinese 25k token TD corpus

  63. Conclusion Under weak supervision, we can use universal grammatical knowledge about context to find trees with a better global structure .

  64. Deficiency • Generative story has a “throw away” step if the context-generated nonterminals don’t match the tree. • We sample only over the space of valid trees (condition on well-formed structures). • This is a benefit of the Bayesian formulation. • See Smith 2011.

  65. Metropolis-Hastings new tree current tree P context ( y ) = P context ( y ′ ) = P full ( y ′ ) / P pcfg ( y ′ ) P full ( y ) / P pcfg ( y ) z ∼ uniform(0,1) P full ( y ′ ) / P pcfg ( y ′ ) P context ( y ′ ) accept if z < = P full ( y ) / P pcfg ( y ) P context ( y )

Recommend


More recommend