cmp784
play

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem - PowerPoint PPT Presentation

Sherlock Holmes mind palace, BBC/Masterpiece's Sherlock CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University // Spring 2018 Breaking news! Practical 2 is due April 6, 23:59 Midterm exam in


  1. Sherlock Holmes’ mind palace, BBC/Masterpiece's Sherlock CMP784 DEEP LEARNING Lecture #08 – Attention and Memory Aykut Erdem // Hacettepe University // Spring 2018

  2. Breaking news! • Practical 2 is due April 6, 23:59 • Midterm exam in class next week (April 12) − Check the midterm guide for details • Practical 3 will be out tomorrow! ! − Language modeling with RNNs 7 − Due Sunday, April 22, 23:59 2

  3. Using RNNs to generate Super Mario Maker levels, Adam Geitgey Previously on CMP784 • Sequence modeling • Recurrent Neural Networks (RNNs) • The Vanilla RNN unit • How to train RNNs • The Long Short-Term Memory (LSTM) unit and its variants • Gated Recurrent Unit (GRU) image: Oleg Soroko 3

  4. Lecture overview • Attention Mechanism for Deep Learning • Attention for Image Captioning • Memory Networks • End-to-end Memory Networks • Dynamic Memory Networks Di Disclaimer: Much of the material and slides for this lecture were borrowed from — Mateusz Malinowski’s lecture on Attention-based Networks — Graham Neubig’s CMU CS11-747 Neural Networks for NLP class — Chris Dyer’s Oxford Deep NLP class — Yoshua Bengio’s talk on From Attention to Memory and towards Longer-Term Dependencies — Sumit Chopra’s lecture on Reasoning, Attention and Memory — Jason Weston’s tutorial on Memory Networks for Language Understanding — Richard Socher’s talk on Dynamic Memory Networks 4

  5. Deep Learning for Vision Deep Learning for Vision −U− Figur −U− Figure credit: Xiaogang Wang 5

  6. Deep Learning for Speech Figure credit: NVidia 6

  7. Deep Learning for Text positive ˆ Y W 3 z 21 z 22 z 23 z 24 z 25 W 2 z 13 z 15 z 16 z 11 z 12 z 14 W 1 x 1 x 2 x 3 x 4 x 5 “The movie was not bad at all. I had fun.” 7

  8. Deep Models Deep Models Loss Function Typically a Linear Projection G W 2 Typically a Linear Pr with some non-linearity with some non-linearity Classifier/Regressor (log-soft-max) (decoder) Fully Connected Network Fully Connected Network can be seen as F W 1 a prior on the type of Convolution Network a prior on the type of Feature Extractor transformation you want mation you want (encoder) Recurrent Network Input Representation “The movie was not bad at all. I had fun.” 8

  9. Deep Models Deep Models Loss Function Typically a Linear Projection G W 2 Typically a Linear Pr with some non-linearity with some non-linearity Classifier/Regressor (log-soft-max) (decoder) Learnable parametric function Inputs: generally considered I.I.D. Fully Connected Network Fully Connected Network can be seen as Outputs: classification or regression F W 1 a prior on the type of Convolution Network a prior on the type of Feature Extractor transformation you want mation you want (encoder) Recurrent Network Input Representation “The movie was not bad at all. I had fun.” 9

  10. Encoder-Decoder Framework • Intermediate representation of meaning = ‘universal representation’ • Encoder: from word sequence to sentence representation • Decoder: from representation to word sequence distribution Decoder y T' y 2 y 1 English sentence English sentence For unilingual data English English For bitext data decoder decoder c French English encoder encoder x 1 x 2 x T French sentence English sentence Encoder 10

  11. Sentence Representations • But what if we could use multiple vectors, based on the length of the sentence this is an example this is an example 11

  12. Attention 12

  13. Basic Idea • Encode each word in the sentence into a vector • When decoding, perform a linear combination of these vectors, weighted by “attention weights” ( wh where to look ) • Use this combination in picking the next item 13

  14. Calculating Attention query vector (decoder state) and ke key vectors (all encoder states) • Use quer • For each query-key kono eiga ga kirai pair, calculate Key weight Vectors • Normalize to add I hate to one using softmax a 1 =2.1 a 2 =-0.1 a 3 =0.3 a 4 =-1.0 Query Vector softmax α 1 =0.76 α 2 =0.08 α 3 =0.13 α 4 =0.03 14

  15. Calculating Attention • Combine together value vectors (usually encoder states, like key vectors) by taking the weighted sum kono eiga ga kirai Value Vectors * * * * α 1 =0.76 α 2 =0.08 α 3 =0.13 α 4 =0.03 • Use this in any part of the model you like 15

  16. A Graphical Example (Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015) 16

  17. End-to-End Machine Translation with Recurrent Nets and Attention Mechanism (Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015) Phrase-based SMT Syntax-based SMT Neural MT 25 20 15 10 Phrase-based SMT 5 25 SMT Syntax-based SMT SMT Neural MT 0 2013 2014 2015 2016 Figure credit: Rico Sennrich 17

  18. Attention Score Functions • q is the query and k is the key • Do Dot Pr Product ct (Luong et al. 2015) • Mu Multi-la layer er Pe Perce ceptron (Bahdanau et al. 2015) a ( q , k ) = q | k a ( q , k ) = w | 2 tanh( W 1 [ q ; k ]) − No parameters! But requires sizes to be the same. − Flexible, often very good with large • Scal Scaled Do Dot Pr Product ct (Vaswani et al. 2017) data − Problem: scale of dot product • Bilinea Bilinear (Luong et al. 2015) increases as dimensions get • larger a ( q , k ) = q | W k − Fix: scale by size of the vector a ( q , k ) = q | k p | k | 18

  19. Case Study: Show, Attend and Tell Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio. ICML 2015 19

  20. Paying Attention to Selected Parts of the Image While Uttering Words 20

  21. ���� ��� ���� Akiko likes Pimm’s </s> ∼ ∼ ∼ ∼ ˆ p 1 softmax softmax softmax softmax h 4 h 1 h 2 h 3 x 1 x 2 x 3 x 4 <s> Sutskever et al. (2014) Sutskever et al. (2014) 21

  22. a man is rowing ∼ ∼ ∼ ∼ ˆ p 1 softmax softmax softmax softmax h 4 h 1 h 2 h 3 x 1 x 2 x 3 x 4 <s> Vinyals et al. (2014) Show and Tell: A Neural Image Caption Generator Vinyals et al. (2014) Show and Tell: A Neural Image Caption Generator 22

  23. Regions in ConvNets Each point in a “higher” level of a convnet 
 • Each point in a “higher” level of a convnet defines spatially localized feature vectors(/matrices). Xu et al. calls these “annotation vectors”, a i , i ∈ { 1 , . . . , L } • Xu et al. calls these “annotation vectors”, 23

  24. Regions in ConvNets a 1 h i F = a 1 24

  25. Regions in ConvNets 25

  26. Regions in ConvNets 26

  27. Extension of LSTM via the context vector • Extract L D-dimensional annotations − Lower convolutional layer to have the correspondence between the feature vectors and portions of the 2-D image 0 1 0 1 i t σ 0 1 E : embedding matrix Ey t − 1 f t σ E - embeddin B C B C A = A T D + m + n,n h t − 1 (1) y : captions B C B C @ A σ o t y - captions @ @ ˆ z t h : previous hidden state tanh h - previous h g t z : context vector, a dynamic representation z - context ve c t = f t � c t − 1 + i t � g t (2) representation of the relevant part of the image input at time t h t = o t � tanh( c t ) . part of the ima (3) e ti = f att ( a i , h t − 1 ) A MLP conditioned on exp( e ti ) the previous hidden state α ti = . P L k =1 exp( e tk ) is the ‘attention’ (‘focus’) function – ‘soft’ / ’hard’ of φ function ˆ z t = φ ( { a i } , { α i } ) is the ‘attention’ (‘focus’) fun p ( y t | a , y t − 1 ) / exp( L o ( Ey t − 1 + L h h t + L z ˆ z t )) 1 27

  28. Hard attention e ti = f att ( a i , h t − 1 ) We have two sequences ‘I’ that runs over localizations ˆ z t = φ ( { a i } , { α i } ) exp( e ti ) ‘t’ that runs over words α ti = . P L Stochastic decisions are discrete k =1 exp( e tk ) here, so derivatives are zero Loss is a variational lower bound on p ( s t,i = 1 | s j<t , a ) = α t,i the marginal log-likelihood the marginal log-likelihood  X ˆ X z t = s t,i a i . L s = p ( s | a ) log p ( y | s, a ) s i X  log p ( s | a ) p ( y | s, a )  ∂ log p ( y | s, a ) � ∂ L s log p ( y | s, a ) ∂ log p ( s | a ) X ∂ W = p ( s | a ) + . s ∂ W ∂ W s = log p ( y | a ) E [log( X )] ≤ lo Due to Jensen’s inequality E [log( X )] ≤ log( E [ X ]) Due to Jensen’s inequality lity s t ∼ Multinoulli L ( { α i } ) ˜ N s n , a ) ∂ W ≈ 1  ∂ log p ( y | ˜ ∂ L s X + N N ∂ W s n , a ) ∂ W ≈ 1  ∂ log p ( y | ˜ ∂ L s X n =1 + s n | a ) N ∂ W s n ] s n , a ) − b ) ∂ log p (˜ ∂ H [˜ � n =1 λ r (log p ( y | ˜ + λ e s n | a ) s n , a ) ∂ log p (˜ � ∂ W ∂ W log p ( y | ˜ To reduce the estimator variance, entropy term H[s] and bias are added [1,2] To reduce the estimator variance, entropy term H[s] ∂ W [1] J. Ba et. al. “Multiple object recognition w [1] J. Ba et al. “Multiple object recognition with visual attention” [2] A. Mnih et al. “Neural variational inference and learning in belief networks” i } by 28 ∂ function W their returns a

Recommend


More recommend