at attent ntio ion the the proble lem
play

At Attent ntio ion The The proble lem For very long sentences, - PowerPoint PPT Presentation

At Attent ntio ion The The proble lem For very long sentences, the score for machine translation really goes down after 30-40 words. With attention Performance degradation Bahdanau et al 2014. Neural machine translation by jointly


  1. At Attent ntio ion

  2. The The proble lem • For very long sentences, the score for machine translation really goes down after 30-40 words. With attention Performance degradation Bahdanau et al 2014. Neural machine translation by jointly learning to align and translate. Prof. Leal-Taixé and Prof. Niessner 2

  3. Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Hidden state Previous input hidden state [Christopher Olah] Understanding LSTMs Prof. Leal-Taixé and Prof. Niessner 3

  4. Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Hidden state Parameters to be learned Prof. Leal-Taixé and Prof. Niessner 4

  5. Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Output Hidden state Same parameters for each time step = generalization! Prof. Leal-Taixé and Prof. Niessner 5

  6. Bas Basic structure e of of a a RN RNN • Unrolling RNNs Hidden state is the same [Christopher Olah] Understanding LSTMs Prof. Leal-Taixé and Prof. Niessner 6

  7. Bas Basic structure e of of a a RN RNN • Unrolling RNNs Prof. Leal-Taixé and Prof. Niessner 7

  8. Lo Long ng-te term depend ndenci ncies I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 8

  9. Atte Attenti ntion: n: intu ntuiti tion ATTENTION: Which hidden states are more important to predict my output? I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 9

  10. Atte Attenti ntion: n: intu ntuiti tion Context α 1 ,t +1 α t,t +1 α t +1 ,t +1 I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 10

  11. Atte Attenti ntion: n: archi hite tectu ture • A decoder processes the information D D D • Decoders take as Context input: α t,t +1 α t +1 ,t +1 – Previous decoder hidden state – Previous output – Attention Prof. Leal-Taixé and Prof. Niessner 11

  12. Attenti Atte ntion indicates how much the word in the position • + 1 α 1 ,t +1 is important to translate the work in position t + 1 • The context aggregates the attention t +1 X c t +1 = α k,t +1 a k k =1 ft attention: All attention masks alpha sum up to 1 • So Soft Prof. Leal-Taixé and Prof. Niessner 12

  13. Comp Computin ing the e atten ention ion ma mask • We can train a small neural network Previous state of d t the decoder NN f 1 ,t +1 Hidden state of a 1 the encoder exp f 1 ,t +1 • Normalize α 1 ,t +1 = P t +1 k =1 exp f k,t +1 Prof. Leal-Taixé and Prof. Niessner 13

  14. At Attent ntio ion n fo for vis visio ion

  15. Wh Why do do we e need eed at atten ention on? • We use the whole image to make the classification BIRD • Are all pixels equally important? Prof. Leal-Taixé and Prof. Niessner 15

  16. Wh Why do do we e need eed at atten ention on? • Wouldn’t it be easier and computationally more efficient to just run our classification network on the patch? Prof. Leal-Taixé and Prof. Niessner 16

  17. Sof Soft t atten ttenti tion on for or ca capti tion oning

  18. Im Image captioning Xu et al 2015. Show attention and tell: neural image caption generation with visual attention. Prof. Leal-Taixé and Prof. Niessner 18

  19. Im Image captioning • Input: image • Output: a sentence describing the image. • Enc Encoder : a classification CNN (VGGNet, AlexNet). This computes a feature maps over the image. • De Decoder : an attention-based RNN – In each time step, the decoder computes an attention map over the entire image, effectively deciding which regions to focus on. – It receives a context vector, which is the weighted average of the conv net features. Prof. Leal-Taixé and Prof. Niessner 19

  20. Con Conven ention ional caption ionin ing LSTM only sees the image once! Encoder Decoder Image from: https://blog.heuritech.com/2016/01/20/attention-mechanism/ Prof. Leal-Taixé and Prof. Niessner 20

  21. Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 21

  22. Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 22

  23. Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 23

  24. Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 24

  25. Atte Attenti ntion n mecha hani nism y " : Output of encoder are the image features which still retain spatial information (no FC layer!) Z " : Output of attention model h " : Hidden state of LSTM Prof. Leal-Taixé and Prof. Niessner 25

  26. Atte Attenti ntion n mecha hani nism How does the attention model look like? Prof. Leal-Taixé and Prof. Niessner 26

  27. Atte Attenti ntion n model • Attention architecture Output attention Any past hidden state Visual features Image: https://blog.heuritech.com/2016/01/20/attention-mechanism/ Prof. Leal-Taixé and Prof. Niessner 27

  28. Atte Attenti ntion n model • Inputs = feature descriptor for each image patch Prof. Leal-Taixé and Prof. Niessner 28

  29. Atte Attenti ntion n model • Inputs = feature descriptor for each image patch Still related to the spatial location of the image Prof. Leal-Taixé and Prof. Niessner 29

  30. Atte Attenti ntion n model • We want an bounded output ! " = tanh ( )* + + ( -* . " Prof. Leal-Taixé and Prof. Niessner 30

  31. Atte Attenti ntion n model • Softmax to create the attention values between 0 and 1 Prof. Leal-Taixé and Prof. Niessner 31

  32. Atte Attenti ntion n model • Multiplied by the image features à ranking by importance Prof. Leal-Taixé and Prof. Niessner 32

  33. Ha Hard a attentio ion mo model • Choosing one of the features by sampling with probabilities s i Prof. Leal-Taixé and Prof. Niessner 33

  34. Ty Types of atte ttenti ntion • Soft at attent ntion : deterministic process that can be backproped • Har ard at attent ntion : stochastic process, gradient is estimated through Monte Carlo sampling. • Soft attention is the most commonly used since it can be incorporated into the optimization more easily Prof. Leal-Taixé and Prof. Niessner 34

  35. Ty Types of atte ttenti ntion • Soft vs hard attention Soft Hard Prof. Leal-Taixé and Prof. Niessner 35

  36. Ty Types of atte ttenti ntion: n: soft Final context Attention Att Can be backproped • module Uses all the image • Image: Stanford CS231n lecture Prof. Leal-Taixé and Prof. Niessner 36

  37. Ty Types of atte ttenti ntion: n: ha hard You can view it as an • image cropping! If we cannot use • gradient descent, what alternative could we use to train this function? Reinforcement Learning Image: Stanford CS231n lecture Prof. Leal-Taixé and Prof. Niessner 37

  38. Im Image captioning with attention Xu et al 2015. Show attention and tell: neural image caption generation with visual attention. Prof. Leal-Taixé and Prof. Niessner 38

  39. In Interesting works on attention Luong et al, “Effective Approaches to Attentionbased Neural Machine • Translation,” EMNLP 2015 Chan et al, “Listen, Attend, and Spell”, arXiv 2015 • Chorowski et al, “Attention-based models for Speech Recognition”, NIPS • 2015 Yao et al, “Describing Videos by Exploiting Temporal Structure”, ICCV • 2015 Xu and Saenko, “Ask, Attend and Answer: Exploring Question-Guided • Spatial Attention for Visual Question Answering”, arXiv 2015 Zhu et al, “Visual7W: Grounded Question Answering in Images”, arXiv • 2015 Chu et al. „Online Multi-Object Tracking Using CNN-based Single Object • Tracker with Spatial-Temporal Attention Mechanism“. ICCV 2017 Prof. Leal-Taixé and Prof. Niessner 39

  40. Con Condi diti tion oning

  41. Wh When en do do we e need eed con ondi dition oning? • Scene understanding from an image and an audio source. Both need to be processed! Prof. Leal-Taixé and Prof. Niessner 41

  42. Wh When en do do we e need eed con ondi dition oning? • Visual Question and Answering: the sentence (question) needs to be understood, the image is needed to create the answer. Prof. Leal-Taixé and Prof. Niessner 42

  43. Wh When en do do we e need eed con ondi dition oning? • Visual Question and Answering: the sentence (question) needs to be understood, the image is needed to create the answer. Prof. Leal-Taixé and Prof. Niessner 43

  44. Wh When en do do we e need eed con ondi dition oning? • We have two sources, can we process one in n the he cont ntext of the other? • Cond nditioni ning ng : the computation carried out by a model is conditioned or modulated by information extracted from an auxiliary input. • Note: a similar thing can be obtained with attention (see p. 39) Prof. Leal-Taixé and Prof. Niessner 44

  45. Wh When en do do we e need eed con ondi dition oning? • Generate images based on a word • Do we need to retrain a model for each word? Image: https://distill.pub/2018/feature-wise-transformations/ Prof. Leal-Taixé and Prof. Niessner 45

Recommend


More recommend