Syntactic Composition (NP The hungry cat ) Need representation for: The hungry cat NP ) NP
Syntactic Composition (NP The hungry cat ) Need representation for: The hungry cat NP ) NP
Syntactic Composition (NP The hungry cat ) Need representation for: The hungry cat NP ) NP
Syntactic Composition Syntactic Composition (NP The hungry cat ) (NP The hungry cat ) Need representation for: Need representation for: The The hungry cat hungry cat ( ( NP NP ) ) NP NP
Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) The hungry cat ( NP ) NP
Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) | {z } v v The cat ( NP ) NP
Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) | {z } v v The cat ( NP ) NP
Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat
Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat NP
Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat VP NP
Stack symbols composed recursively mirror corresponding tree structure S Effect Stack encodes NP VP top-down syntactic recency, rather than left-to-right The hungry cat meows . string recency . meows The hungry cat VP NP S
Implementing RNNGs Stack RNNs • Augment a sequential RNN with a stack pointer • Two constant-time operations • push - read input, add to top of stack, connect to current location of the stack pointer • pop - move stack pointer to its parent • A summary of stack contents is obtained by accessing the output of the RNN at location of the stack pointer • Note: push and pop are discrete actions here ( cf . Grefenstette et al., 2015)
Implementing RNNGs Stack RNNs y 0 PUSH ∅
Implementing RNNGs Stack RNNs y 0 y 1 POP ∅ x 1
Implementing RNNGs Stack RNNs y 0 y 1 ∅ x 1
Implementing RNNGs Stack RNNs y 0 y 1 PUSH ∅ x 1
Implementing RNNGs Stack RNNs y 0 y 1 y 2 POP ∅ x 1 x 2
Implementing RNNGs Stack RNNs y 0 y 1 y 2 ∅ x 1 x 2
Implementing RNNGs Stack RNNs y 0 y 1 y 2 PUSH ∅ x 1 x 2
Implementing RNNGs Stack RNNs y 0 y 1 y 2 y 3 ∅ x 3 x 1 x 2
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) stack top
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) stack S top
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) NP top stack S
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top NP stack S
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top NP stack S
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top NP stack S
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) NP top stack S
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP top stack S
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top VP NP stack S
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP top stack S
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top VP NP stack S
The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP stack S top
Each word is conditioned on history represented by a trio of RNNs S NP VP The hungry cat meows . p( meows |history) S( NP( The hungry cat ) VP( meows ) . ) VP NP stack S
Train with backpropagation through structure In training, This network is S backpropagate dynamic . Don’t through these derive gradients three RNNs) by hand—that’s NP VP error prone. Use automatic differentiation The hungry cat meows . instead S( NP( The hungry cat ) VP( meows ) . ) And recursively through this VP NP structure. stack S
Complete model Sequence of actions sentence (completely defines x and y ) Actions up to time t bias tree history allowable embedding action actions at embedding this step
Complete model Sequence of actions sentence (completely defines x and y ) Actions up to time t bias tree Model is dynamic : history allowable variable number of embedding action actions at context-dependent embedding this step actions at each step
Complete model stack output action (buffer) history
Complete model stack output action (buffer) history
Implementing RNNGs Parameter Estimation • RNNGs jointly model sequences of words together with p θ ( x , y ) a “tree structure”, • Any parse tree can be converted to a sequence of actions (depth first traversal) and vice versa (subject to wellformedness constraints) • We use trees from the Penn Treebank • We could treat the non-generation actions as latent variables or learn them with RL, effectively making this a problem of grammar induction . Future work…
Implementing RNNGs Inference • An RNNG is a joint distribution p ( x , y ) over strings ( x ) and parse trees ( y ) • We are interested in two inference questions: • What is p ( x ) for a given x ? [ language modeling ] • What is max p ( y | x ) for a given x ? [ parsing ] y • Unfortunately, the dynamic programming algorithms we often rely on are of no help here • We can use importance sampling to do both by sampling from a discriminatively trained model
English PTB (Parsing) Type F1 Petrov and Klein G 90.1 (2007) Shindo et al (2012) G 91.1 Single model Shindo et al (2012) ~G 92.4 Ensemble Vinyals et al (2015) D 90.5 PTB only Vinyals et al (2015) S 92.8 Ensemble Discriminative D 89.8 Generative (IS) G 92.4
Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable
Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable w ( x , y ) = p ( x , y ) Let the importance weights q ( y | x )
Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable w ( x , y ) = p ( x , y ) Let the importance weights q ( y | x ) X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y )
Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y )
Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y ) Replace this expectation with its Monte Carlo estimate. y ( i ) ∼ q ( y | x ) for i ∈ { 1 , 2 , . . . , N }
Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y ) Replace this expectation with its Monte Carlo estimate. y ( i ) ∼ q ( y | x ) for i ∈ { 1 , 2 , . . . , N } N 1 MC X w ( x , y ( i ) ) E q ( y | x ) w ( x , y ) ≈ N i =1
English PTB (LM) Perplexity 5-gram IKN 169.3 LSTM + Dropout 113.4 Generative (IS) 102.4 Chinese CTB (LM) Perplexity 5-gram IKN 255.2 LSTM + Dropout 207.3 Generative (IS) 171.9
Do we need a stack? Kuncoro et al., Oct 2017 • Both stack and action history encode the same information, but expose it to the classifier in different ways. Leaving out stack is harmful; using it on its own works slightly better than complete model!
RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?
RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?
RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?
RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?
Recommend
More recommend