recurrent neural network grammars
play

Recurrent neural network grammars Slide credits: Chris Dyer, - PowerPoint PPT Presentation

Recurrent neural network grammars Slide credits: Chris Dyer, Adhiguna Kuncoro Widespread phenomenon: Polarity items can only appear in certain contexts Example: anybody is a polarity item that tends to appear only in specific contexts:


  1. Syntactic Composition (NP The hungry cat ) Need representation for: The hungry cat NP ) NP

  2. Syntactic Composition (NP The hungry cat ) Need representation for: The hungry cat NP ) NP

  3. Syntactic Composition (NP The hungry cat ) Need representation for: The hungry cat NP ) NP

  4. Syntactic Composition Syntactic Composition (NP The hungry cat ) (NP The hungry cat ) Need representation for: Need representation for: The The hungry cat hungry cat ( ( NP NP ) ) NP NP

  5. Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) The hungry cat ( NP ) NP

  6. Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) | {z } v v The cat ( NP ) NP

  7. Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) | {z } v v The cat ( NP ) NP

  8. Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat

  9. Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat NP

  10. Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat VP NP

  11. Stack symbols composed recursively mirror corresponding tree structure S Effect Stack encodes NP VP top-down syntactic recency, rather than left-to-right The hungry cat meows . string recency . meows The hungry cat VP NP S

  12. Implementing RNNGs 
 Stack RNNs • Augment a sequential RNN with a stack pointer • Two constant-time operations • push - read input, add to top of stack, connect to current location of the stack pointer • pop - move stack pointer to its parent • A summary of stack contents is obtained by accessing the output of the RNN at location of the stack pointer • Note: push and pop are discrete actions here 
 ( cf . Grefenstette et al., 2015)

  13. Implementing RNNGs 
 Stack RNNs y 0 PUSH ∅

  14. Implementing RNNGs 
 Stack RNNs y 0 y 1 POP ∅ x 1

  15. Implementing RNNGs 
 Stack RNNs y 0 y 1 ∅ x 1

  16. Implementing RNNGs 
 Stack RNNs y 0 y 1 PUSH ∅ x 1

  17. Implementing RNNGs 
 Stack RNNs y 0 y 1 y 2 POP ∅ x 1 x 2

  18. Implementing RNNGs 
 Stack RNNs y 0 y 1 y 2 ∅ x 1 x 2

  19. Implementing RNNGs 
 Stack RNNs y 0 y 1 y 2 PUSH ∅ x 1 x 2

  20. Implementing RNNGs 
 Stack RNNs y 0 y 1 y 2 y 3 ∅ x 3 x 1 x 2

  21. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) stack top

  22. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) stack S top

  23. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) NP top stack S

  24. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top NP stack S

  25. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top NP stack S

  26. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top NP stack S

  27. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) NP top stack S

  28. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP top stack S

  29. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top VP NP stack S

  30. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP top stack S

  31. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top VP NP stack S

  32. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP stack S top

  33. Each word is conditioned on history represented by a trio of RNNs S NP VP The hungry cat meows . p( meows |history) S( NP( The hungry cat ) VP( meows ) . ) VP NP stack S

  34. Train with backpropagation through structure In training, This network is S backpropagate dynamic . Don’t through these derive gradients three RNNs) by hand—that’s NP VP error prone. Use automatic differentiation The hungry cat meows . instead S( NP( The hungry cat ) VP( meows ) . ) And recursively through this VP NP structure. stack S

  35. Complete model Sequence of actions sentence (completely defines x and y ) Actions up to time t bias tree history allowable embedding action actions at embedding this step

  36. Complete model Sequence of actions sentence (completely defines x and y ) Actions up to time t bias tree Model is dynamic : history allowable variable number of embedding action actions at context-dependent embedding this step actions at each step

  37. Complete model stack output action (buffer) history

  38. Complete model stack output action (buffer) history

  39. Implementing RNNGs 
 Parameter Estimation • RNNGs jointly model sequences of words together with p θ ( x , y ) a “tree structure”, • Any parse tree can be converted to a sequence of actions (depth first traversal) and vice versa (subject to wellformedness constraints) • We use trees from the Penn Treebank • We could treat the non-generation actions as latent variables or learn them with RL, effectively making this a problem of grammar induction . Future work…

  40. Implementing RNNGs 
 Inference • An RNNG is a joint distribution p ( x , y ) over strings ( x ) and parse trees ( y ) • We are interested in two inference questions: • What is p ( x ) for a given x ? [ language modeling ] • What is max p ( y | x ) for a given x ? [ parsing ] y • Unfortunately, the dynamic programming algorithms we often rely on are of no help here • We can use importance sampling to do both by sampling from a discriminatively trained model

  41. English PTB (Parsing) Type F1 Petrov and Klein G 90.1 (2007) Shindo et al (2012) 
 G 91.1 Single model Shindo et al (2012) 
 ~G 92.4 Ensemble Vinyals et al (2015) 
 D 90.5 PTB only Vinyals et al (2015) 
 S 92.8 Ensemble Discriminative D 89.8 Generative (IS) G 92.4

  42. Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable

  43. Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable w ( x , y ) = p ( x , y ) Let the importance weights q ( y | x )

  44. Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable w ( x , y ) = p ( x , y ) Let the importance weights q ( y | x ) X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y )

  45. Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y )

  46. Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y ) Replace this expectation with its Monte Carlo 
 estimate. y ( i ) ∼ q ( y | x ) for i ∈ { 1 , 2 , . . . , N }

  47. Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y ) Replace this expectation with its Monte Carlo 
 estimate. y ( i ) ∼ q ( y | x ) for i ∈ { 1 , 2 , . . . , N } N 1 MC X w ( x , y ( i ) ) E q ( y | x ) w ( x , y ) ≈ N i =1

  48. English PTB (LM) Perplexity 5-gram IKN 169.3 LSTM + Dropout 113.4 Generative (IS) 102.4 Chinese CTB (LM) Perplexity 5-gram IKN 255.2 LSTM + Dropout 207.3 Generative (IS) 171.9

  49. Do we need a stack? Kuncoro et al., Oct 2017 • Both stack and action history encode the same information, but expose it to the classifier in different ways. Leaving out stack is harmful; using it on its own works slightly better than complete model!

  50. RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?

  51. RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?

  52. RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?

  53. RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?

Recommend


More recommend