At Attent ntio ion
The The proble lem • For very long sentences, the score for machine translation really goes down after 30-40 words. With attention Performance degradation Bahdanau et al 2014. Neural machine translation by jointly learning to align and translate. Prof. Leal-Taixé and Prof. Niessner 2
Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Hidden state Previous input hidden state [Christopher Olah] Understanding LSTMs Prof. Leal-Taixé and Prof. Niessner 3
Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Hidden state Parameters to be learned Prof. Leal-Taixé and Prof. Niessner 4
Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Output Hidden state Same parameters for each time step = generalization! Prof. Leal-Taixé and Prof. Niessner 5
Bas Basic structure e of of a a RN RNN • Unrolling RNNs Hidden state is the same [Christopher Olah] Understanding LSTMs Prof. Leal-Taixé and Prof. Niessner 6
Bas Basic structure e of of a a RN RNN • Unrolling RNNs Prof. Leal-Taixé and Prof. Niessner 7
Lo Long ng-te term depend ndenci ncies I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 8
Atte Attenti ntion: n: intu ntuiti tion ATTENTION: Which hidden states are more important to predict my output? I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 9
Atte Attenti ntion: n: intu ntuiti tion Context α 1 ,t +1 α t,t +1 α t +1 ,t +1 I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 10
Atte Attenti ntion: n: archi hite tectu ture • A decoder processes the information D D D • Decoders take as Context input: α t,t +1 α t +1 ,t +1 – Previous decoder hidden state – Previous output – Attention Prof. Leal-Taixé and Prof. Niessner 11
Attenti Atte ntion indicates how much the word in the position • + 1 α 1 ,t +1 is important to translate the work in position t + 1 • The context aggregates the attention t +1 X c t +1 = α k,t +1 a k k =1 ft attention: All attention masks alpha sum up to 1 • So Soft Prof. Leal-Taixé and Prof. Niessner 12
Comp Computin ing the e atten ention ion ma mask • We can train a small neural network Previous state of d t the decoder NN f 1 ,t +1 Hidden state of a 1 the encoder exp f 1 ,t +1 • Normalize α 1 ,t +1 = P t +1 k =1 exp f k,t +1 Prof. Leal-Taixé and Prof. Niessner 13
At Attent ntio ion n fo for vis visio ion
Wh Why do do we e need eed at atten ention on? • We use the whole image to make the classification BIRD • Are all pixels equally important? Prof. Leal-Taixé and Prof. Niessner 15
Wh Why do do we e need eed at atten ention on? • Wouldn’t it be easier and computationally more efficient to just run our classification network on the patch? Prof. Leal-Taixé and Prof. Niessner 16
Sof Soft t atten ttenti tion on for or ca capti tion oning
Im Image captioning Xu et al 2015. Show attention and tell: neural image caption generation with visual attention. Prof. Leal-Taixé and Prof. Niessner 18
Im Image captioning • Input: image • Output: a sentence describing the image. • Enc Encoder : a classification CNN (VGGNet, AlexNet). This computes a feature maps over the image. • De Decoder : an attention-based RNN – In each time step, the decoder computes an attention map over the entire image, effectively deciding which regions to focus on. – It receives a context vector, which is the weighted average of the conv net features. Prof. Leal-Taixé and Prof. Niessner 19
Con Conven ention ional caption ionin ing LSTM only sees the image once! Encoder Decoder Image from: https://blog.heuritech.com/2016/01/20/attention-mechanism/ Prof. Leal-Taixé and Prof. Niessner 20
Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 21
Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 22
Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 23
Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 24
Atte Attenti ntion n mecha hani nism y " : Output of encoder are the image features which still retain spatial information (no FC layer!) Z " : Output of attention model h " : Hidden state of LSTM Prof. Leal-Taixé and Prof. Niessner 25
Atte Attenti ntion n mecha hani nism How does the attention model look like? Prof. Leal-Taixé and Prof. Niessner 26
Atte Attenti ntion n model • Attention architecture Output attention Any past hidden state Visual features Image: https://blog.heuritech.com/2016/01/20/attention-mechanism/ Prof. Leal-Taixé and Prof. Niessner 27
Atte Attenti ntion n model • Inputs = feature descriptor for each image patch Prof. Leal-Taixé and Prof. Niessner 28
Atte Attenti ntion n model • Inputs = feature descriptor for each image patch Still related to the spatial location of the image Prof. Leal-Taixé and Prof. Niessner 29
Atte Attenti ntion n model • We want an bounded output ! " = tanh ( )* + + ( -* . " Prof. Leal-Taixé and Prof. Niessner 30
Atte Attenti ntion n model • Softmax to create the attention values between 0 and 1 Prof. Leal-Taixé and Prof. Niessner 31
Atte Attenti ntion n model • Multiplied by the image features à ranking by importance Prof. Leal-Taixé and Prof. Niessner 32
Ha Hard a attentio ion mo model • Choosing one of the features by sampling with probabilities s i Prof. Leal-Taixé and Prof. Niessner 33
Ty Types of atte ttenti ntion • Soft at attent ntion : deterministic process that can be backproped • Har ard at attent ntion : stochastic process, gradient is estimated through Monte Carlo sampling. • Soft attention is the most commonly used since it can be incorporated into the optimization more easily Prof. Leal-Taixé and Prof. Niessner 34
Ty Types of atte ttenti ntion • Soft vs hard attention Soft Hard Prof. Leal-Taixé and Prof. Niessner 35
Ty Types of atte ttenti ntion: n: soft Final context Attention Att Can be backproped • module Uses all the image • Image: Stanford CS231n lecture Prof. Leal-Taixé and Prof. Niessner 36
Ty Types of atte ttenti ntion: n: ha hard You can view it as an • image cropping! If we cannot use • gradient descent, what alternative could we use to train this function? Reinforcement Learning Image: Stanford CS231n lecture Prof. Leal-Taixé and Prof. Niessner 37
Im Image captioning with attention Xu et al 2015. Show attention and tell: neural image caption generation with visual attention. Prof. Leal-Taixé and Prof. Niessner 38
In Interesting works on attention Luong et al, “Effective Approaches to Attentionbased Neural Machine • Translation,” EMNLP 2015 Chan et al, “Listen, Attend, and Spell”, arXiv 2015 • Chorowski et al, “Attention-based models for Speech Recognition”, NIPS • 2015 Yao et al, “Describing Videos by Exploiting Temporal Structure”, ICCV • 2015 Xu and Saenko, “Ask, Attend and Answer: Exploring Question-Guided • Spatial Attention for Visual Question Answering”, arXiv 2015 Zhu et al, “Visual7W: Grounded Question Answering in Images”, arXiv • 2015 Chu et al. „Online Multi-Object Tracking Using CNN-based Single Object • Tracker with Spatial-Temporal Attention Mechanism“. ICCV 2017 Prof. Leal-Taixé and Prof. Niessner 39
Con Condi diti tion oning
Wh When en do do we e need eed con ondi dition oning? • Scene understanding from an image and an audio source. Both need to be processed! Prof. Leal-Taixé and Prof. Niessner 41
Wh When en do do we e need eed con ondi dition oning? • Visual Question and Answering: the sentence (question) needs to be understood, the image is needed to create the answer. Prof. Leal-Taixé and Prof. Niessner 42
Wh When en do do we e need eed con ondi dition oning? • Visual Question and Answering: the sentence (question) needs to be understood, the image is needed to create the answer. Prof. Leal-Taixé and Prof. Niessner 43
Wh When en do do we e need eed con ondi dition oning? • We have two sources, can we process one in n the he cont ntext of the other? • Cond nditioni ning ng : the computation carried out by a model is conditioned or modulated by information extracted from an auxiliary input. • Note: a similar thing can be obtained with attention (see p. 39) Prof. Leal-Taixé and Prof. Niessner 44
Wh When en do do we e need eed con ondi dition oning? • Generate images based on a word • Do we need to retrain a model for each word? Image: https://distill.pub/2018/feature-wise-transformations/ Prof. Leal-Taixé and Prof. Niessner 45
Recommend
More recommend