Sequence-to-Sequence with RNNs an and A>en?on on Use a different context vector in each $mestep of decoder - Input sequence not bo9lenecked through single vector comiendo [STOP] estamos pan - At each $mestep of decoder, context vector “looks at” different parts of the input sequence y 1 y 2 y 3 y 4 h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 we are ea%ng bread [START] estamos comiendo pan Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 20
Sequence-to-Sequence with RNNs an and A>en?on on Visualize a2en<on weights a t,i Example : English to French transla<on Input : “The agreement on the European Economic Area was signed in August 1992.” Output : “L’accord sur la zone économique européenne a été signé en août 1992.” Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 21
Sequence-to-Sequence with RNNs an and A>en?on on Visualize a2en<on weights a t,i Example : English to French transla<on Diagonal a)en+on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” Output : “ L’accord sur la zone économique européenne a été signé en août 1992 .” Diagonal a)en+on means words correspond in order Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 22
Sequence-to-Sequence with RNNs an and A>en?on on Visualize a2en<on weights a t,i Example : English to French transla<on Diagonal a)en+on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” A)en+on figures out different word orders Output : “ L’accord sur la zone économique européenne a été signé en août 1992 .” Diagonal a)en+on means words correspond in order Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 23
Sequence-to-Sequence with RNNs an and A>en?on on Visualize a2en<on weights a t,i Example : English to French transla<on Diagonal a)en+on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” A)en+on figures out different word orders Output : “ L’accord sur la zone Verb conjuga+on économique européenne a été signé en août 1992 .” Diagonal a)en+on means words correspond in order Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 24
Sequence-to-Sequence with RNNs an and A>en?on on The decoder doesn’t use the fact that h i form an ordered sequence – it just treats them as an unordered set {h i } comiendo [STOP] estamos pan Can use similar architecture given any y 1 y 2 y 3 y 4 set of input hidden vectors {h i }! h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 we are ea%ng bread [START] estamos comiendo pan Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 25
Image Cap?oning with RNNs and A>en?on h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Cat image is free to use under the Pixabay License Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 26
Image Cap?oning with RNNs and A>en?on Alignment scores e t,i,j = f a& (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 e 1,2,1 e 1,2,2 e 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 27
Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soZmax(e t,:,: ) so#max e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 28
Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soZmax(e t,:,: ) so#max e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 c 1 Use a CNN to compute a grid of features for an image Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 29
Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soZmax(e t,:,: ) so#max cat e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 30
Image Cap?oning with RNNs and A>en?on e t,i,j = f a& (s t-1 , h i,j ) a t,:,: = soZmax(e t,:,: ) cat c t = ∑ i,j a t,i,j h i,j y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 31
Image Cap?oning with RNNs and A>en?on Alignment scores e t,i,j = f a& (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soZmax(e t,:,: ) cat e 2,2,1 e 2,2,2 e 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 32
Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soZmax(e t,:,: ) so#max cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 33
Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soZmax(e t,:,: ) so#max cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 34
Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soZmax(e t,:,: ) so#max si9ng cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 y 2 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 Use a CNN to compute a grid of features for an image [START] cat Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 35
Image Cap?oning with RNNs and A>en?on Each <mestep of decoder e t,i,j = f a& (s t-1 , h i,j ) uses a different context a t,:,: = soZmax(e t,:,: ) si9ng [STOP] cat outside vector that looks at different c t = ∑ i,j a t,i,j h i,j parts of the input image y 1 y 2 y 3 y 4 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 s 3 s 4 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 Use a CNN to compute a grid of features for an image [START] cat si9ng outside Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 36
Image Cap?oning with RNNs and A>en?on Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 37
Image Cap?oning with RNNs and A>en?on Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 38
Human Vision: Fovea Light enters eye Re$na detects light Acuity graph is licensed under CC A-SA 3.0 Unported Jus$n Johnson October 14, 2020 Lecture 13 - 39
Human Vision: Fovea The fovea is a %ny region of the Light enters eye re%na that can see with high acuity Re$na detects light Acuity graph is licensed under CC A-SA 3.0 Unported (No changes made) Eye image is licensed under CC A-SA 3.0 Unported (added black arrow, green arc, and white circle) Jus$n Johnson October 14, 2020 Lecture 13 - 40
Human Vision: Saccades The fovea is a %ny region of the Human eyes are constantly moving so we don’t no%ce re%na that can see with high acuity Saccade video is licensed under CC A-SA 4.0 InternaFonal (no changes made) Acuity graph is licensed under CC A-SA 3.0 Unported (No changes made) Jus$n Johnson October 14, 2020 Lecture 13 - 41
Image Cap?oning with RNNs and A>en?on A2en<on weights at each <mestep kind of like saccades of human eye Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Saccade video is licensed under CC A-SA 4.0 InternaFonal (no changes made) Jus$n Johnson October 14, 2020 Lecture 13 - 42
X, A>end, and Y “ Show, a9end, and tell ” (Xu et al, ICML 2015) Look at image, aWend to image regions, produce ques$on “ Ask, a9end, and answer ” (Xu and Saenko, ECCV 2016) “ Show, ask, a9end, and answer” (Kazemi and Elqursh, 2017) Read text of ques$on, aWend to image regions, produce answer “ Listen, a9end, and spell ” (Chan et al, ICASSP 2016) Process raw audio, aWend to audio regions while producing text “ Listen, a9end, and walk ” (Mei et al, AAAI 2016) Process text, aWend to text regions, output naviga$on commands “ Show, a9end, and interact ” (Qureshi et al, ICRA 2017) Process image, aWend to image regions, output robot control commands “ Show, a9end, and read ” (Li et al, AAAI 2019) Process image, aWend to image regions, output text Jus$n Johnson October 14, 2020 Lecture 13 - 43
A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D X ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func$on : f a+ h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa$on : Similari$es : e (Shape: N X ) e i = f a+ ( q , X i ) A9en$on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 14, 2020 Lecture 13 - 44
A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func$on : dot product h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa$on : Changes: Similari$es : e (Shape: N X ) e i = q · X i - Use dot product for similarity A9en$on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 14, 2020 Lecture 13 - 45
A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func$on : scaled dot product h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa$on : Changes: Similari$es : e (Shape: N X ) e i = q · X i / sqrt(D Q ) - Use scaled dot product for similarity A9en$on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 14, 2020 Lecture 13 - 46
A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func$on : scaled dot product h 3,1 h 3,2 h 3,3 Large similari%es will cause soDmax to saturate and give vanishing gradients c 1 y 0 c 2 Recall a · b = |a||b| cos(angle) Suppose that a and b are constant vectors of [START] dimension D Then |a| = (∑ i a 2 ) 1/2 = a 𝐸 Computa$on : Changes: Similari$es : e (Shape: N X ) e i = q · X i / 𝐸 ! - Use scaled dot product for similarity A9en$on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 14, 2020 Lecture 13 - 47
A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vectors : Q (Shape: N Q x D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa$on : Changes: Similari$es : E = QX T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · X j )/ 𝐸 ! - Use dot product for similarity A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) - Mul$ple query vectors Output vectors : Y = A X (Shape: N Q x D X ) Y i = ∑ j A i,j X j Jus$n Johnson October 14, 2020 Lecture 13 - 48
A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vectors : Q (Shape: N Q x D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D X ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Key matrix : W K (Shape: D X x D Q ) h 3,1 h 3,2 h 3,3 Value matrix: W V (Shape: D X x D V ) c 1 y 0 c 2 [START] Computa$on : Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! Changes: - Use dot product for similarity A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) - Mul$ple query vectors Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j - Separate key and value Jus$n Johnson October 14, 2020 Lecture 13 - 49
A>en?on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa$on : X 1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! X 3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 50
A>en?on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa$on : X 1 K 1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! X 3 K 3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 51
A>en?on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa$on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 52
A>en?on Layer Inputs : A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa$on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 53
A>en?on Layer Inputs : V 1 A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) V 2 A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) V 3 A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa$on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 54
Y 1 Y 2 Y 3 Y 4 A>en?on Layer Product( ), Sum( ) Inputs : V 1 A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) V 2 A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) V 3 A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa$on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 55
Self-A>en?on Layer One query per input vector Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa$on : Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 56
Self-A>en?on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) Computa$on : Query vectors : Q = XW Q Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 57
Self-A>en?on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) K 3 Computa$on : K 2 Query vectors : Q = XW Q K 1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 58
Self-A>en?on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 59
Self-A>en?on Layer One query per input vector A 3,3 A 1,3 A 2,3 Inputs : A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 60
Self-A>en?on Layer One query per input vector V 3 A 3,3 A 1,3 A 2,3 Inputs : V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 61
Y 1 Y 3 Y 2 Self-A>en?on Layer Product(→), Sum(↑) One query per input vector V 3 A 3,3 A 1,3 A 2,3 Inputs : V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 62
Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) Computa$on : Query vectors : Q = XW Q Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 63
Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Queries and Keys will be the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) K 2 Computa$on : K 1 Query vectors : Q = XW Q K 3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 64
Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Similari%es will be the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa$on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 65
Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng A 2,2 A 3,2 A 1,2 the input vectors: Inputs : A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) ARen%on weights will be A 1,3 A 2,3 A 3,3 the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa$on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 66
Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Values will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa$on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 67
Y 3 Y 2 Y 1 Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Outputs will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa$on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 68
Y 3 Y 2 Y 1 Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Outputs will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Self-aRen%on layer is Query matrix : W Q (Shape: D X x D Q ) Permuta+on Equivariant E 1,2 E 2,2 K 2 E 3,2 f(s(x)) = s(f(x)) Computa$on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q Self-ARen%on layer works K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) on sets of vectors Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 69
Y 1 Y 3 Y 2 Self-A>en?on Layer Product(→), Sum(↑) Self aRen%on doesn’t V 3 A 3,3 A 1,3 A 2,3 “know” the order of the Inputs : vectors it is processing! V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 70
Y 1 Y 3 Y 2 Self-A>en?on Layer Product(→), Sum(↑) Self aRen%on doesn’t V 3 A 3,3 A 1,3 A 2,3 “know” the order of the Inputs : vectors it is processing! V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 In order to make Key matrix : W K (Shape: D X x D Q ) processing posi%on- Value matrix: W V (Shape: D X x D V ) So#max(↑) aware, concatenate input Query matrix : W Q (Shape: D X x D Q ) with posi+onal encoding E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 E can be learned lookup Query vectors : Q = XW Q table, or fixed func%on K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j E(1) E(2) E(3) Jus$n Johnson October 14, 2020 Lecture 13 - 71
Y 1 Y 3 Y 2 Masked Self-A>en?on Layer Ma Product(→), Sum(↑) Don’t let vectors “look ahead” in the sequence V 3 A 3,3 0 0 Inputs : V 2 0 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) -∞ E 3,3 K 3 -∞ Computa$on : K 2 -∞ E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 72
Big cat [END] Masked Self-A>en?on Layer Ma Product(→), Sum(↑) Don’t let vectors “look ahead” in the sequence V 3 A 3,3 Used for language modeling (predict next word) 0 0 Inputs : V 2 0 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) -∞ E 3,3 K 3 -∞ Computa$on : K 2 -∞ E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) [START] Big cat Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 73
Y 1 Y 3 Y 2 ?head Self-A>en?on Layer Mul Mul?he Use H independent “AWen$on Heads” in parallel Concat Inputs : Input vectors : X (Shape: N X x D X ) Y 1 Y 2 Y 3 Y 1 Y 2 Y 3 Y 1 Y 2 Y 3 Product(→), Sum(↑) Product(→), Sum(↑) Product(→), Sum(↑) Key matrix : W K (Shape: D X x D Q ) V 3 A 3,3 V 3 A 3,3 V 3 A 3,3 A 1,3 A 2,3 A 1,3 A 2,3 A 1,3 A 2,3 V 2 V 2 V 2 A 1,2 A 2,2 A 3,2 A 1,2 A 2,2 A 3,2 A 1,2 A 2,2 A 3,2 Value matrix: W V (Shape: D X x D V ) V 1 V 1 V 1 A 1,1 A 2,1 A 3,1 A 1,1 A 2,1 A 3,1 A 1,1 A 2,1 A 3,1 Hyperparameters : Softmax(↑) Softmax(↑) Softmax(↑) Query matrix : W Q (Shape: D X x D Q ) K 3 E 1,3 E 2,3 E 3,3 K 3 E 1,3 E 2,3 E 3,3 K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 2 E 1,2 E 2,2 E 3,2 K 2 E 1,2 E 2,2 E 3,2 Query dimension D Q K 1 E 1,1 E 2,1 E 3,1 K 1 E 1,1 E 2,1 E 3,1 K 1 E 1,1 E 2,1 E 3,1 Q 3 Q 3 Q 3 Q 1 Q 2 Q 1 Q 2 Q 1 Q 2 Number of heads H X 1 X 2 X 3 X 1 X 2 X 3 X 1 X 2 X 3 Computa$on : Query vectors : Q = XW Q Split Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 74
Example: CNN with Self-A>en?on Input Image CNN Features: C x H x W Cat image is free to use under the Pixabay License Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 75
Example: CNN with Self-A>en?on Queries : C’ x H x W 1x1 Conv Input Image Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W 1x1 Conv Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 76
Example: CNN with Self-A>en?on A)en+on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W 1x1 Conv Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 77
Example: CNN with Self-A>en?on A)en+on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 78
Example: CNN with Self-A>en?on A)en+on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x C x H x H Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv 1x1 Conv Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 79
Example: CNN with Self-A>en?on Residual Connec+on A)en+on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x C x H x W Keys : + CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv 1x1 Conv Self-A2en<on Module Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 80
Three Ways of Processing Sequences Recurrent Neural Network y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 Works on Ordered Sequences (+) Good at long sequences: ANer one RNN layer, h T ”sees” the whole sequence (-) Not parallelizable: need to compute hidden states sequen+ally Jus$n Johnson October 14, 2020 Lecture 13 - 81
Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu<on y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Works on Ordered Sequences Works on Mul+dimensional Grids (+) Good at long sequences: ANer (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole stack many conv layers for outputs sequence to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can compute hidden states sequen+ally be computed in parallel Jus$n Johnson October 14, 2020 Lecture 13 - 82
Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu<on Self-A2en<on Y 1 Y 2 Y 3 Product(→), Sum(↑) y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 V 3 A 2,3 A 3,3 A 1,3 V 2 A 1,2 A 2,2 A 3,2 V 1 A 1,1 A 2,1 A 3,1 Softmax(↑) K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 1 E 2,1 E 3,1 E 1,1 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Q 1 Q 2 Q 3 X 1 X 2 X 3 Works on Ordered Sequences Works on Sets of Vectors Works on Mul+dimensional Grids (+) Good at long sequences: ANer (-) Good at long sequences: aNer one (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole self-a)en+on layer, each output stack many conv layers for outputs sequence “sees” all inputs! to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can compute hidden states sequen+ally be computed in parallel be computed in parallel (-) Very memory intensive Jus$n Johnson October 14, 2020 Lecture 13 - 83
Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu<on Self-A2en<on Y 1 Y 2 Y 3 Product(→), Sum(↑) y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 V 3 A 2,3 A 3,3 A 1,3 A"en%on is all you need V 2 A 1,2 A 2,2 A 3,2 V 1 A 1,1 A 2,1 A 3,1 Softmax(↑) K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 1 E 2,1 E 3,1 E 1,1 Vaswani et al, NeurIPS 2017 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Q 1 Q 2 Q 3 X 1 X 2 X 3 Works on Ordered Sequences Works on Sets of Vectors Works on Mul+dimensional Grids (+) Good at long sequences: ANer (-) Good at long sequences: aNer one (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole self-a)en+on layer, each output stack many conv layers for outputs sequence “sees” all inputs! to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can compute hidden states sequen+ally be computed in parallel be computed in parallel (-) Very memory intensive Jus$n Johnson October 14, 2020 Lecture 13 - 84
The Transformer x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 85
The Transformer All vectors interact Self-AWen$on with each other x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 86
The Transformer Residual connec<on + All vectors interact Self-AWen$on with each other x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 87
The Transformer Recall Layer NormalizaGon : Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) shiO: 𝛾 (Shape: D) 𝜈 i = (∑ j h i,j )/D (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 /D) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec<on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AWen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 88
The Transformer Recall Layer NormalizaGon : Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiO: 𝛾 (Shape: D) on each vector 𝜈 i = (∑ j h i,j )/D (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 /D) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec<on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AWen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 89
The Transformer Recall Layer NormalizaGon : Residual connec<on + Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiO: 𝛾 (Shape: D) on each vector 𝜈 i = (∑ j h i,j )/D (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 /D) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec<on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AWen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 90
The Transformer y 1 y 2 y 3 y 4 Layer Normaliza$on Recall Layer NormalizaGon : Residual connec<on + Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiO: 𝛾 (Shape: D) on each vector 𝜈 i = (∑ j h i,j )/D (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 /D) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec<on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AWen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 91
The Transformer y 1 y 2 y 3 y 4 Transformer Block: Layer Normaliza$on Input : Set of vectors x + Output : Set of vectors y MLP MLP MLP MLP Self-a2en<on is the only interac<on between vectors! Layer Normaliza$on Layer norm and MLP work + independently per vector Self-AWen$on Highly scalable, highly parallelizable x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 92
The Transformer Layer Normalization + MLP MLP MLP MLP Transformer Block: Layer Normalization + Input : Set of vectors x Self-Attention Output : Set of vectors y A Transformer is a sequence Layer Normalization of transformer blocks + Self-a2en<on is the only MLP MLP MLP MLP interac<on between vectors! Vaswani et al: Layer Normalization + 12 blocks, D Q =512, 6 heads Self-Attention Layer norm and MLP work Layer Normalization independently per vector + MLP MLP MLP MLP Highly scalable, highly Layer Normalization + parallelizable Self-Attention Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 93
The Transformer: Transfer Learning Layer Normalization + MLP MLP MLP MLP Layer Normalization “ImageNet Moment for Natural Language Processing” + Self-Attention Pretraining : Layer Normalization + Download a lot of text from the internet MLP MLP MLP MLP Layer Normalization + Train a giant Transformer model for language modeling Self-Attention Layer Normalization + Finetuning: MLP MLP MLP MLP Fine-tune the Transformer on your own NLP task Layer Normalization + Self-Attention Devlin et al, "BERT: Pre-training of Deep BidirecAonal Transformers for Language Understanding", EMNLP 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 94
Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Justin Johnson October 14, 2020 Lecture 13 - 95
Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB Devlin et al, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", EMNLP 2018 Justin Johnson October 14, 2020 Lecture 13 - 96
Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) Yang et al, XLNet: Generalized Autoregressive Pretraining for Language Understanding", 2019 Liu et al, "RoBERTa: A Robustly Optimized BERT Pretraining Approach", 2019 Justin Johnson October 14, 2020 Lecture 13 - 97
Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Radford et al, "Language models are unsupervised multitask learners", 2019 Jus$n Johnson October 14, 2020 Lecture 13 - 98
Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days) Shoeybi et al, "Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism", 2019 Justin Johnson October 14, 2020 Lecture 13 - 99
~$430,000 on Amazon AWS! Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days) Shoeybi et al, "Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism", 2019 Justin Johnson October 14, 2020 Lecture 13 - 100
Recommend
More recommend