Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) • Vanilla RNNs: h t = tanh( Ux t + Wh t − 1 + b ) • GRUs (Gated Recurrent Units): z t = σ ( U ( z ) x t + W ( z ) h t − 1 + b ( z ) ) r t = σ ( U ( r ) x t + W ( r ) h t − 1 + b ( r ) ) ˜ h t = tanh( U ( h ) x t + W ( h ) ( r t � h t − 1 ) + b ( h ) ) Z: Update gate h t = (1 � z t ) � h t − 1 + z t � ˜ h t R: Reset gate Less parameters than LSTMs. Easier to train for comparable performance! ℎ↓ ℎ↓ ℎ↓ ℎ↓ 4 1 2 3 𝑦↓ 3 𝑦↓ 4 𝑦↓ 1 𝑦↓ 2
Gates • Gates contextually control information flow • Open/close with sigmoid • In LSTMs and GRUs, they are used to (contextually) maintain longer term history 28
Bi-directional RNNs • Can incorporate context from both directions • Generally improves over uni-directional RNNs 29
Google NMT (Oct 2016)
Recursive Neural Networks • Sometimes, inference over a tree structure makes more sense than sequential structure • An example of compositionality in ideological bias detection (red → conservative, blue → liberal, gray → neutral) in which modifier phrases and punctuation cause polarity switches at higher levels of the parse tree Example from Iyyer et al., 2014
Recursive Neural Networks • NNs connected as a tree • Tree structure is fixed a priori • Parameters are shared, similarly as RNNs Example from Iyyer et al., 2014
Tree LSTMs • Are tree LSTMs more expressive than sequence LSTMs? • I.e., recursive vs recurrent • When Are Tree Structures Necessary for Deep Learning of Representations? Jiwei Li, Minh-Thang Luong, Dan Jurafsky and Eduard Hovy. EMNLP , 2015. 33
Neural Probabilistic Language Model (Bengio 2003) 34
Neural Probabilistic Language Model (Bengio 2003) • Each word prediction is a separate feed forward neural network • Feedforward NNLM is a Markovian language model • Dashed lines show optional direct connections NN DMLP 1 ( x ) = [ tanh ( xW 1 + b 1 ) , x ] W 2 + b 2 I W 1 ∈ R d in × d hid , b 1 ∈ R 1 × d hid ; first a ffi ne transformation I W 2 ∈ R ( d hid + d in ) × d out , b 2 ∈ R 1 × d out ; second a ffi ne transformation 35
LEARNING: LEARNING: BACKPROP BACKPROPAGA AGATION TION
Next 10 slides on back propagation are adapted from Andrew Rosenberg Error Backpropagation • Model parameters: ✓ = { w (1) ij , w (2) jk , w (3) ~ kl } for brevity: ~ ✓ = { w ij , w jk , w kl } w (1) w (2) ij jk x 0 w (3) kl x 1 f ( x, ~ ✓ ) x 2 x P
Learning: Gradient Descent ij − η ∂ R w t +1 w t = ij w ij jk − η ∂ R w t +1 w t = jk w kl kl − η ∂ R w t +1 w t = kl w kl a j z j z k a k a l z i z l w jk w ij x 0 w kl x 1 f ( x, ~ ✓ ) x 2 x P
Backpropagation Starts with a forward sweep to compute all the intermediate function • z i values δ j ∂ R Through backprop, computes the partial derivatives recursively • ∂ w ij A form of dynamic programming • – Instead of considering exponentially many paths between a weight w_ij and the final loss (risk), store and reuse intermediate results. A type of automatic differentiation. (there are other variants e.g., recursive • differentiation only through forward propagation. Forward Gradient
Backpropagation Primary Interface Language: TensorFlow (https://www.tensorflow.org/) Python • • Torch (http://torch.ch/) Lua • • Theano (http://deeplearning.net/software/theano/) Python • • CNTK (https://github.com/Microsoft/CNTK) C++ • • cnn (https://github.com/clab/cnn) C++ • • Caffe (http://caffe.berkeleyvision.org/) C++ • • Forward Gradient
Cross Entropy Loss (aka log loss, logistic loss) • Cross Entropy X H ( p, q ) = − p ( y ) log q ( y ) Predicted prob y • Related quantities True prob X H ( p ) = p ( y )log p ( y ) – Entropy y – KL divergence (the distance between two distributions p and q) p ( y ) log p ( y ) X D KL ( p || q ) = q ( y ) y H ( p, q ) = E p [ − log q ] = H ( p ) + D KL ( p || q ) • Use Cross Entropy for models that should have more probabilistic flavor (e.g., language models) • Use Mean Squared Error loss for models that focus on correct/ incorrect predictions MSE = 1 2( y − f ( x )) 2
RNN Learning: Backprop Through Time (BPTT) • Similar to backprop with non-recurrent NNs • But unlike feedforward (non-recurrent) NNs, each unit in the computation graph repeats the exact same parameters… • Backprop gradients of the parameters of each unit as if they are different parameters • When updating the parameters using the gradients, use the average gradients throughout the entire chain of units. ℎ↓ 4 ℎ↓ 1 ℎ↓ 2 ℎ↓ 3 𝑦↓ 3 𝑦↓ 4 𝑦↓ 1 𝑦↓ 2
LEARNING: TRAINING DEEP LEARNING: TRAINING DEEP NETWORKS NETWORKS
Vanishing / exploding Gradients • Deep networks are hard to train • Gradients go through multiple layers • The multiplicative effect tends to lead to exploding or vanishing gradients • Practical solutions w.r.t. – network architecture – numerical operations 44
Vanishing / exploding Gradients • Practical solutions w.r.t. network architecture – Add skip connections to reduce distance • Residual networks, highway networks, … – Add gates (and memory cells) to allow longer term memory • LSTMs, GRUs, memory networks, … 45
Gradients of deep networks NN layer ( x ) = ReLU ( xW 1 + b 1 ) h n h n − 1 . . . h 2 h 1 x I Can have similar issues with vanishing gradients. ∂ L ∂ L = ∑ 1 ( h n , j n > 0 ) W j n − 1 , j n ∂ h n − 1, j n − 1 ∂ h n , j n j n 46 Diagram borrowed from Alex Rush
Effects of Skip Connections on Gradients • Thought Experiment: Additive Skip-Connections NN sl 1 ( x ) = 1 2 ReLU ( xW 1 + b 1 ) + 1 2 x h n h n − 1 . . . ∂ L ∂ L h 3 1 ( ∑ = 1 ( h n , j n > 0 ) W j n − 1 , j n ) + 2 ∂ h n − 1, j n − 1 ∂ h n , j n j n h 2 ∂ L 1 ( h n − 1, j n − 1 ) h 1 2 ∂ h n , j n − 1 x 47 Diagram borrowed from Alex Rush
Effects of Skip Connections on Gradients • Thought Experiment: Dynamic Skip-Connections ( 1 − t ) ReLU ( xW 1 + b 1 ) + t x NN sl 2 ( x ) = σ ( xW t + b t ) = t W 1 R d hid × d hid ∈ W t R d hid × 1 ∈ h n h n − 1 . . . h 3 h 2 h 1 48 Diagram borrowed from Alex Rush x
Highway Network (Srivastava et al., 2015) • A plain feedforward neural network: y = H ( x , W H ) . – H is a typical affine transformation followed by a non- linear activation • Highway network: y = H ( x , W H ) · T ( x , W T ) + x · C ( x , W C ) . – T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity 49
Residual Networks • Residual net • Plaint net 0 0 weight layer weight layer any two relu b(0) identity relu stacked layers 0 weight layer weight layer relu a 0 = b 0 + 0 a(0) relu • ResNet (He et al. 2015): first very deep (152 layers) network successfully trained for object recognition 50
Residual Networks • Residual net • Plaint net 0 0 weight layer weight layer any two relu b(0) identity relu stacked layers 0 weight layer weight layer relu a 0 = b 0 + 0 a(0) relu • F(x) is a residual mapping with respect to identity • Direct input connection +x leads to a nice property w.r.t. back propagation --- more direct influence from the final loss to any deep layer • In contrast, LSTMs & Highway networks allow for long distance input connection only through “gates”. 51
Residual Networks Revolution of Depth soft max2 Soft maxAct ivat ion FC AveragePool 7x7+ 1(V) AlexNet, 8 layers 11x11 conv, 96, /4, pool/2 VGG, 19 layers 3x3 conv, 64 GoogleNet, 22 layers Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 5x5 conv, 256, pool/2 3x3 conv, 64, pool/2 (ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2014) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat 3x3 conv, 384 3x3 conv, 128 Conv Conv Conv Conv soft max1 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Soft maxAct ivation 3x3 conv, 384 3x3 conv, 128, pool/2 MaxPool FC 3x3+ 2(S) Dept hConcat FC 3x3 conv, 256, pool/2 3x3 conv, 256 Conv Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S) Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V) fc, 4096 3x3 conv, 256 Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) fc, 4096 3x3 conv, 256 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat soft max0 fc, 1000 3x3 conv, 256, pool/2 Conv Conv Conv Conv Soft maxAct ivat ion 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool FC 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512 Dept hConcat FC Conv Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S) 3x3 conv, 512 Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V) Dept hConcat 3x3 conv, 512 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512, pool/2 MaxPool 3x3+ 2(S) Dept hConcat 3x3 conv, 512 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512 Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 3x3 conv, 512 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) MaxPool 3x3+ 2(S) 3x3 conv, 512, pool/2 LocalRespNorm Conv 3x3+ 1(S) fc, 4096 Conv 1x1+ 1(V) LocalRespNorm fc, 4096 MaxPool 3x3+ 2(S) Conv fc, 1000 7x7+ 2(S) input Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 52
Residual Networks 7x7 conv, 64, /2, pool/2 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 Revolution of Depth 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 11x11 conv, 96, /4, pool/2 AlexNet, 8 layers 5x5 conv, 256, pool/2 VGG, 19 layers ResNet, 152 layers 1x1 conv, 256 3x3 conv, 64 3x3 conv, 256 3x3 conv, 384 3x3 conv, 384 3x3 conv, 64, pool/2 1x1 conv, 1024 3x3 conv, 128 1x1 conv, 256 3x3 conv, 256, pool/2 fc, 4096 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 fc, 4096 1x1 conv, 1024 fc, 1000 3x3 conv, 256 1x1 conv, 256 3x3 conv, 256 3x3 conv, 256 (ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2015) 3x3 conv, 256, pool/2 1x1 conv, 1024 3x3 conv, 512 1x1 conv, 256 3x3 conv, 512 3x3 conv, 256 3x3 conv, 512 1x1 conv, 1024 3x3 conv, 512, pool/2 1x1 conv, 256 3x3 conv, 512 3x3 conv, 256 3x3 conv, 512 1x1 conv, 1024 3x3 conv, 512 1x1 conv, 256 3x3 conv, 512, pool/2 3x3 conv, 256 fc, 4096 1x1 conv, 1024 fc, 4096 1x1 conv, 256 fc, 1000 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 53
Residual Networks Revolution of Depth 28.2 25.8 152 layers 16.4 11.7 22 layers 19 layers 7.3 6.7 3.57 8 layers 8 layers shallow ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10 ResNet GoogleNet VGG AlexNet ImageNet Classification top-5 error (%) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 54
Highway Network (Srivastava et al., 2015) • A plain feedforward neural network: y = H ( x , W H ) . – H is a typical affine transformation followed by a non- linear activation • Highway network: y = H ( x , W H ) · T ( x , W T ) + x · C ( x , W C ) . – T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity 55
@Schmidhubered 56
Vanishing / exploding Gradients • Practical solutions w.r.t. numerical operations – Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid) • ReLU or hard-tanh instead 57
Sigmoid • Often used for gates 1 σ ( x ) = • Pro: neuron-like, 1 + e − x differentiable σ 0 ( x ) = σ ( x )(1 − σ ( x )) • Con: gradients saturate to zero almost everywhere except x near zero => vanishing gradients • Batch normalization helps 58
Tanh • Often used for hidden states & cells tanh( x ) = e x − e − x e x + e − x in RNNs, LSTMs • Pro: differentiable, tanh’(x) = 1 − tanh 2 ( x ) often converges faster than sigmoid tanh( x ) = 2 σ (2 x ) − 1 • Con: gradients easily saturate to zero => vanishing gradients 59
Hard Tanh • Pro: computationally cheaper − 1 t < − 1 • Con: saturates to hardtanh ( t ) = − 1 ≤ t ≤ 1 t zero easily, doesn’t differentiate at 1, -1 1 t > 1 60
ReLU • Pro: doesn’t saturate for ReLU( x ) = max(0 , x ) x > 0, computationally cheaper, induces sparse 1 x > 0 NNs d ReLU ( x ) = 0 x < 0 dx • Con: non-differentiable 1 or 0 o . w at 0 • Used widely in deep NN, but not as much in RNNs • We informally use subgradients: 61
Vanishing / exploding Gradients • Practical solutions w.r.t. numerical operations – Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid) • ReLU or hard-tanh instead – Batch Normalization: add intermediate input normalization layers 62
Batch Normalization 63
Regularization • Regularization by objective term n y c 0 ) } + λ || θ || 2 ∑ L ( θ ) = max { 0, 1 � ( ˆ y c � ˆ i = 1 – Modify loss with L1 or L2 norms • Less depth, smaller hidden states, early stopping • Dr Dropout opout – Randomly delete parts of network during training – Each node (and its corresponding incoming and outgoing edges) dropped with a probability p – P is higher for internal nodes, lower for input nodes – The full network is used for testing – Faster training, better results – Vs. Bagging 64
Convergence of backprop • Without non-linearity or hidden layers, learning is convex optimization – Gradient descent reaches global minima global minima • Multilayer neural nets (with nonlinearity) are not not convex convex – Gradient descent gets stuck in local minima – Selecting number of hidden units and layers = fuzzy process – NNs have made a HUGE comeback in the last few years • Neural nets are back with a new name – Deep belief networks – Huge error reduction when trained with lots of data on GPUs
RECAP RECAP
Vanishing / exploding Gradients • Deep networks are hard to train • Gradients go through multiple layers • The multiplicative effect tends to lead to exploding or vanishing gradients • Practical solutions w.r.t. – network architecture – numerical operations 67
Vanishing / exploding Gradients • Practical solutions w.r.t. network architecture – Add skip connections to reduce distance • Residual networks, highway networks, … – Add gates (and memory cells) to allow longer term memory • LSTMs, GRUs, memory networks, … 68
seq2seq (aka “encoder-decoder”) h t = f ( x t , h t − 1 ) y t = softmax( V h t )
Google NMT (Oct 2016)
ATTENTION! TTENTION!
Seq-to-Seq with Attention Diagram from http://distill.pub/2016/augmented-rnns/ 72
Seq-to-Seq with Attention Diagram from http://distill.pub/2016/augmented-rnns/ 73
Trial: Hard Attention s t • At each step generating the target word i s s • Compute the best alignment to the source word j • And incorporate the source word to generate the target word w t i +1 = argmax w O ( w, s t i +1 , s s j ) • Contextual hard alignment. How? z j = tanh([ s t i , s s j ] W + b ) j = argmax j z j • Problem? 74
Encoder – Decoder Architecture Sequence-to-Sequence the red dog ˆ ˆ ˆ y 1 y 2 y 3 s s s s s s s t s t s t 1 2 3 1 2 3 ˆ ˆ ˆ x 1 x 2 x 3 x 1 x 2 x 3 the red dog < s > Diagram borrowed from Alex Rush 75
Attention: Soft Alignments s t • At each step generating the target word i • Compute the attention to the source sequence s s c • And incorporate the attention to generate the target word w t i +1 = argmax w O ( w, s t i +1 , c ) • Contextual attention as soft alignment. How? z j = tanh([ s t i , s s j ] W + b ) α = softmax( z ) X α j s s c = j j – Step-1: compute the attention weights 76 – Step-2: compute the attention vector as interpolation
Attention function parameterization z j = tanh([ s t i ; s s j ] W + b ) • Feedforward NNs z j = tanh([ s t i ; s s j ; s t i � s s j ] W + b ) • Dot product z j = s t i · s s j s t i · s s j • Cosine similarity z j = || s t i |||| s s j || T Ws s • Bi-linear models z j = s t i j 77
78
Learned Attention! Diagram borrowed from Alex Rush 79
Qualitative results Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft” (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object ( white indicates the attended regions, underlines indicated the corresponding word) 80 27 M. Malinowski
POINTER NETWORKS POINTER NETWORKS
Convex haul, Delaunay Triangulation, Traveling Salesman Can we model these problems using seq-to-seq? 82
Pointer Networks! (Vinyals et al. 2015) • NNs with attention: content-based attention to input • Pointer networks: location-based attention to input 83
Pointer Networks (b) Ptr-Net (a) Sequence-to-Sequence 84
Pointer Networks Attention Mechanism vs Pointer Networks Attention mechanism Ptr-Net Softmax normalizes the vector e ij to be an output distribution over the dictionary of inputs Diagram borrowed from Keon Kim 85
CopyNet (Gu et al. 2016) • Conversation – I: Hello Jack, my name is Chandralekha – R: Nice to meet you, Chandralekha – I: This new guy doesn’t perform exactly as expected. – R: what do you mean by “doesn’t perform exactly as expected?” • Translation 86
CopyNet (Gu et al. 2016) (b) Generate-Mode & Copy-Mode Prob (“ Jebara ”) = Prob( “ Jebara ” , g) + Prob( “ Jebara ” , c) Softmax hi , Tony Jebara … ... Vocabulary Source s 1 s 2 s 3 s 4 M <eos> hi , Tony s 4 Attentive Read Embedding for “Tony” DNN Selective Read for “Tony” “Tony” M h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 hello , my name is Tony Jebara . ! (c) State Update (a) Attention-based Encoder-Decoder (RNNSearch) 87
CopyNet (Gu et al. 2016) • Key idea: interpolation between generation model & copy model p ( y t | s t , y t − 1 , c t , M ) = p ( y t , g | s t , y t − 1 , c t , M ) + p ( y t , c | s t , y t − 1 , c t , M ) (4) 1 8 Generate-Mode: The same scoring function as Z e ψ g ( y t ) , y t 2 V > > in the generic RNN encoder-decoder (Bahdanau et < y t 2 X \ ¯ p ( y t , g |· )= 0 , V (5) al., 2014) is used, i.e. 1 > Z e ψ g ( UNK ) > y t 62 V [ X : ψ g ( y t = v i ) = v > v i ∈ V ∪ UNK i W o s t , (7) ( 1 j : x j = y t e ψ c ( x j ) , P y t 2 X where W o ∈ R ( N +1) ⇥ d s and v i is the one-hot in- p ( y t , c |· )= (6) Z 0 otherwise dicator vector for v i . Copy-Mode: The score for “copying” the word x j is calculated as ⇣ ⌘ h > ψ c ( y t = x j ) = σ x j ∈ X j W c s t , (8) 88
BiDAF 89
NEURAL CHECK LIST NEURAL CHECK LIST
Neural Checklist Models (Kiddon et al., 2016) • What can we do with gating & attention? 91
Encoder--Decoder Architecture Chop the tomatoes . Add <s> Chop the tomatoes . Want to update ingredient Doesn’t information as address ingredients are changing used ingredients garlic tomato salsa
Encode title - decode recipe sausage sandwiches Cut each sandwich in halves. Sandwiches with sandwiches. Sandwiches, sandwiches, Sandwiches, sandwiches, sandwiches sandwiches, sandwiches, sandwiches, sandwiches, sandwiches, sandwiches, or sandwiches or triangles, a griddle, each sandwich. Top each with a slice of cheese, tomato, and cheese. Top with remaining cheese mixture. Top with remaining cheese. Broil until tops are bubbly and cheese is melted, about 5 minutes.
Recipe generation vs vs machine translation decode decode recipe ecipe token token by by token token <S> decode decode recipe ecipe token token by by Only ~6-10% words align • recipe title ecipe title between input and output. The rest must be generated ingr ingredient 1 edient 1 • ingredient 2 ingr edient 2 from context (and implicit ingr ingredient 3 edient 3 knowledge about cooking) ingr ingredient 4 edient 4 Contextual switch between • two different input sources Two input sources
Encoder--Decoder with Attention Chop the tomatoes . Add <s> Chop the tomatoes . Want to update ingredient Doesn’t information as address ingredients are changing used ingredients garlic tomato salsa
Neural checklist model
� Let’s make salsa! Garlic tomato salsa tomatoes � onions � garlic � salt �
Neural checklist model Chop � hidden state classifier: � non-ingredient � new ingredient � new hidden state � used ingredient � LM � which ingredients � are still available � <S> � garlic � tomato � salsa �
Neural checklist model Chop � the � tomatoes � . � 0.85 0.10 0.04 0.01 non- new ingredient ingredient <S> � Chop � the � tomatoes � ✓
Neural checklist model Dice � the � onions � . � 0.00 0.94 0.03 0.01 . � Dice � the � onions � ✓ ✓ ✓
Recommend
More recommend