lecture 13 a en on
play

Lecture 13: A,en.on Jus$n Johnson October 23, 2019 Lecture 13 - 1 - PowerPoint PPT Presentation

Lecture 13: A,en.on Jus$n Johnson October 23, 2019 Lecture 13 - 1 Midterm Grades will be out in ~1 week Please do not discuss midterm ques$ons on Piazza Someone leD a waterboEle in exam room Post on Piazza if it is yours Jus$n Johnson


  1. Sequence-to-Sequence with RNNs an and A,en.on on ✖ ✖ ✖ ✖ a 23 a 24 a 21 a 22 comiendo estamos soDmax Repeat: Use s 1 to y 1 y 2 compute new context e 21 e 22 e 23 e 24 vector c 2 + Use c 2 to compute s 2 , y 2 h 1 h 2 h 3 h 4 s 0 s 1 s 2 Intui0on : Context vector aEends to the relevant x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 part of the input sequence “comiendo” = “ea0ng” we are ea$ng bread so maybe a 21 =a 24 =0.05, [START] estamos a 22 =0.1, a 23 =0.8 Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 20

  2. Sequence-to-Sequence with RNNs an and A,en.on on Use a different context vector in each 0mestep of decoder - Input sequence not bo<lenecked through single vector comiendo [STOP] estamos pan - At each 0mestep of decoder, context vector “looks at” different parts of the input sequence y 1 y 2 y 3 y 4 h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 we are ea$ng bread [START] estamos comiendo pan Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 21

  3. Sequence-to-Sequence with RNNs an and A,en.on on Visualize aEen$on weights a t,i Example : English to French transla$on Input : “The agreement on the European Economic Area was signed in August 1992.” Output : “L’accord sur la zone économique européenne a été signé en août 1992.” Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 22

  4. Sequence-to-Sequence with RNNs an and A,en.on on Visualize aEen$on weights a t,i Example : English to French transla$on Diagonal a<en0on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” Output : “ L’accord sur la zone économique européenne a été signé en août 1992 .” Diagonal a<en0on means words correspond in order Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 23

  5. Sequence-to-Sequence with RNNs an and A,en.on on Visualize aEen$on weights a t,i Example : English to French transla$on Diagonal a<en0on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” A<en0on figures out different word orders Output : “ L’accord sur la zone économique européenne a été signé en août 1992 .” Diagonal a<en0on means words correspond in order Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 24

  6. Sequence-to-Sequence with RNNs an and A,en.on on Visualize aEen$on weights a t,i Example : English to French transla$on Diagonal a<en0on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” A<en0on figures out different word orders Output : “ L’accord sur la zone Verb conjuga0on économique européenne a été signé en août 1992 .” Diagonal a<en0on means words correspond in order Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 25

  7. Sequence-to-Sequence with RNNs an and A,en.on on The decoder doesn’t use the fact that h i form an ordered sequence – it just treats them as an unordered set {h i } comiendo [STOP] estamos pan Can use similar architecture given any y 1 y 2 y 3 y 4 set of input hidden vectors {h i }! h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 we are ea$ng bread [START] estamos comiendo pan Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 26

  8. Image Cap.oning with RNNs and A,en.on h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Cat image is free to use under the Pixabay License Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 27

  9. Image Cap.oning with RNNs and A,en.on Alignment scores e t,i,j = f aE (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 e 1,2,1 e 1,2,2 e 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 28

  10. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soDmax(e t,:,: ) soDmax e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 29

  11. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soDmax(e t,:,: ) soDmax e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 c 1 Use a CNN to compute a grid of features for an image Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 30

  12. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soDmax(e t,:,: ) soDmax cat e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 31

  13. Image Cap.oning with RNNs and A,en.on e t,i,j = f aE (s t-1 , h i,j ) a t,:,: = soDmax(e t,:,: ) cat c t = ∑ i,j a t,i,j h i,j y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 32

  14. Image Cap.oning with RNNs and A,en.on Alignment scores e t,i,j = f aE (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soDmax(e t,:,: ) cat e 2,2,1 e 2,2,2 e 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 33

  15. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soDmax(e t,:,: ) soDmax cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 34

  16. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soDmax(e t,:,: ) soDmax cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 35

  17. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soDmax(e t,:,: ) soDmax sivng cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 y 2 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 Use a CNN to compute a grid of features for an image [START] cat Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 36

  18. Image Cap.oning with RNNs and A,en.on Each $mestep of decoder e t,i,j = f aE (s t-1 , h i,j ) uses a different context a t,:,: = soDmax(e t,:,: ) sivng [STOP] cat outside vector that looks at different c t = ∑ i,j a t,i,j h i,j parts of the input image y 1 y 2 y 3 y 4 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 s 3 s 4 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 Use a CNN to compute a grid of features for an image [START] cat sivng outside Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 37

  19. Image Cap.oning with RNNs and A,en.on Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 38

  20. Image Cap.oning with RNNs and A,en.on Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 39

  21. Human Vision: Fovea Light enters eye Re0na detects light Acuity graph is licensed under CC A-SA 3.0 Unported Jus$n Johnson October 23, 2019 Lecture 13 - 40

  22. Human Vision: Fovea The fovea is a $ny region of the Light enters eye re$na that can see with high acuity Re0na detects light Acuity graph is licensed under CC A-SA 3.0 Unported (No changes made) Eye image is licensed under CC A-SA 3.0 Unported (added black arrow, green arc, and white circle) Jus$n Johnson October 23, 2019 Lecture 13 - 41

  23. Human Vision: Saccades The fovea is a $ny region of the Human eyes are constantly moving so we don’t no$ce re$na that can see with high acuity Saccade video is licensed under CC A-SA 4.0 Interna$onal (no changes made) Acuity graph is licensed under CC A-SA 3.0 Unported (No changes made) Jus$n Johnson October 23, 2019 Lecture 13 - 42

  24. Image Cap.oning with RNNs and A,en.on AEen$on weights at each $mestep kind of like saccades of human eye Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Saccade video is licensed under CC A-SA 4.0 Interna$onal (no changes made) Jus$n Johnson October 23, 2019 Lecture 13 - 43

  25. X, A,end, and Y “ Show, a<end, and tell ” (Xu et al, ICML 2015) Look at image, aEend to image regions, produce ques$on “ Ask, a<end, and answer ” (Xu and Saenko, ECCV 2016) “ Show, ask, a<end, and answer” (Kazemi and Elqursh, 2017) Read text of ques$on, aEend to image regions, produce answer “ Listen, a<end, and spell ” (Chan et al, ICASSP 2016) Process raw audio, aEend to audio regions while producing text “ Listen, a<end, and walk ” (Mei et al, AAAI 2016) Process text, aEend to text regions, output naviga$on commands “ Show, a<end, and interact ” (Qureshi et al, ICRA 2017) Process image, aEend to image regions, output robot control commands “ Show, a<end, and read ” (Li et al, AAAI 2019) Process image, aEend to image regions, output text Jus$n Johnson October 23, 2019 Lecture 13 - 44

  26. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D X ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func0on : f aE h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa0on : Similari0es : e (Shape: N X ) e i = f aE ( q , X i ) A<en0on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 23, 2019 Lecture 13 - 45

  27. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func0on : dot product h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa0on : Changes: Similari0es : e (Shape: N X ) e i = q · X i - Use dot product for similarity A<en0on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 23, 2019 Lecture 13 - 46

  28. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func0on : scaled dot product h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa0on : Changes: Similari0es : e (Shape: N X ) e i = q · X i / sqrt(D Q ) - Use scaled dot product for similarity A<en0on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 23, 2019 Lecture 13 - 47

  29. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func0on : scaled dot product h 3,1 h 3,2 h 3,3 Large similari$es will cause soDmax to saturate and give vanishing gradients c 1 y 0 c 2 Recall a · b = |a||b| cos(angle) Suppose that a and b are constant vectors of [START] dimension D Then |a| = (∑ i a 2 ) 1/2 = a sqrt(D) Computa0on : Changes: Similari0es : e (Shape: N X ) e i = q · X i / sqrt(D Q ) - Use scaled dot product for similarity A<en0on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 23, 2019 Lecture 13 - 48

  30. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vectors : Q (Shape: N Q x D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa0on : Similari0es : E = QX T (Shape: N Q x N X ) E i,j = Q i · X j / sqrt(D Q ) Changes: - Use dot product for similarity A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) - Mul$ple query vectors Output vectors : Y = A X (Shape: N Q x D X ) Y i = ∑ j A i,j X j Jus$n Johnson October 23, 2019 Lecture 13 - 49

  31. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vectors : Q (Shape: N Q x D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D X ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Key matrix : W K (Shape: D X x D Q ) h 3,1 h 3,2 h 3,3 Value matrix: W V (Shape: D X x D V ) c 1 y 0 c 2 [START] Computa0on : Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) Changes: - Use dot product for similarity A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) - Mul$ple query vectors Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j - Separate key and value Jus$n Johnson October 23, 2019 Lecture 13 - 50

  32. A,en.on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa0on : X 1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) X 3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 51

  33. A,en.on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa0on : X 1 K 1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) X 3 K 3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 52

  34. A,en.on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa0on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 53

  35. A,en.on Layer Inputs : A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa0on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 54

  36. A,en.on Layer Inputs : V 1 A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) V 2 A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) V 3 A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa0on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 55

  37. Y 1 Y 2 Y 3 Y 4 A,en.on Layer Product( ), Sum( ) Inputs : V 1 A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) V 2 A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) V 3 A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa0on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 56

  38. Self-A,en.on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) Computa0on : Query vectors : Q = XW Q Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 57

  39. Self-A,en.on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) Computa0on : Query vectors : Q = XW Q Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 58

  40. Self-A,en.on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) K 3 Computa0on : K 2 Query vectors : Q = XW Q K 1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 59

  41. Self-A,en.on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 60

  42. Self-A,en.on Layer One query per input vector A 3,3 A 1,3 A 2,3 Inputs : A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 61

  43. Self-A,en.on Layer One query per input vector V 3 A 3,3 A 1,3 A 2,3 Inputs : V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 62

  44. Y 1 Y 3 Y 2 Self-A,en.on Layer Product(→), Sum(↑) One query per input vector V 3 A 3,3 A 1,3 A 2,3 Inputs : V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 63

  45. Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) Computa0on : Query vectors : Q = XW Q Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 64

  46. Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Queries and Keys will be the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) K 2 Computa0on : K 1 Query vectors : Q = XW Q K 3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 65

  47. Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Similari$es will be the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa0on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 66

  48. Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng A 2,2 A 3,2 A 1,2 the input vectors: Inputs : A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) AEen$on weights will be A 1,3 A 2,3 A 3,3 the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa0on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 67

  49. Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Values will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa0on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 68

  50. Y 3 Y 2 Y 1 Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Outputs will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa0on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 69

  51. Y 3 Y 2 Y 1 Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Outputs will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Self-aEen$on layer is Query matrix : W Q (Shape: D X x D Q ) Permuta0on Equivariant E 1,2 E 2,2 K 2 E 3,2 f(s(x)) = s(f(x)) Computa0on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q Self-AEen$on layer works K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) on sets of vectors Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 70

  52. Y 1 Y 3 Y 2 Self-A,en.on Layer Product(→), Sum(↑) Self aEen$on doesn’t V 3 A 3,3 A 1,3 A 2,3 “know” the order of the Inputs : vectors it is processing! V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 71

  53. Y 1 Y 3 Y 2 Self-A,en.on Layer Product(→), Sum(↑) Self aEen$on doesn’t V 3 A 3,3 A 1,3 A 2,3 “know” the order of the Inputs : vectors it is processing! V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 In order to make Key matrix : W K (Shape: D X x D Q ) processing posi$on- Value matrix: W V (Shape: D X x D V ) SoDmax(↑) aware, concatenate input Query matrix : W Q (Shape: D X x D Q ) with posi0onal encoding E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 E can be learned lookup Query vectors : Q = XW Q table, or fixed func$on K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j E(1) E(2) E(3) Jus$n Johnson October 23, 2019 Lecture 13 - 72

  54. Y 1 Y 3 Y 2 Masked Self-A,en.on Layer Ma Product(→), Sum(↑) Don’t let vectors “look ahead” in the sequence V 3 A 3,3 0 0 Inputs : V 2 0 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) -∞ E 3,3 K 3 -∞ Computa0on : K 2 -∞ E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 73

  55. Big cat [END] Masked Self-A,en.on Layer Ma Product(→), Sum(↑) Don’t let vectors “look ahead” in the sequence V 3 A 3,3 Used for language modeling (predict next word) 0 0 Inputs : V 2 0 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) -∞ E 3,3 K 3 -∞ Computa0on : K 2 -∞ E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) [START] Big cat Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 74

  56. Y 1 Y 3 Y 2 .head Self-A,en.on Layer Mul Mul.he Use H independent “AEen$on Heads” in parallel Concat Inputs : Input vectors : X (Shape: N X x D X ) Y 1 Y 2 Y 3 Y 1 Y 2 Y 3 Y 1 Y 2 Y 3 Product(→), Sum(↑) Product(→), Sum(↑) Product(→), Sum(↑) Key matrix : W K (Shape: D X x D Q ) V 3 A 3,3 V 3 A 3,3 V 3 A 3,3 A 1,3 A 2,3 A 1,3 A 2,3 A 1,3 A 2,3 V 2 V 2 V 2 A 1,2 A 2,2 A 3,2 A 1,2 A 2,2 A 3,2 A 1,2 A 2,2 A 3,2 Value matrix: W V (Shape: D X x D V ) V 1 V 1 V 1 A 1,1 A 2,1 A 3,1 A 1,1 A 2,1 A 3,1 A 1,1 A 2,1 A 3,1 Hyperparameters : Softmax(↑) Softmax(↑) Softmax(↑) Query matrix : W Q (Shape: D X x D Q ) K 3 E 1,3 E 2,3 E 3,3 K 3 E 1,3 E 2,3 E 3,3 K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 2 E 1,2 E 2,2 E 3,2 K 2 E 1,2 E 2,2 E 3,2 Query dimension D Q K 1 E 1,1 E 2,1 E 3,1 K 1 E 1,1 E 2,1 E 3,1 K 1 E 1,1 E 2,1 E 3,1 Q 3 Q 3 Q 3 Q 1 Q 2 Q 1 Q 2 Q 1 Q 2 Number of heads H X 1 X 2 X 3 X 1 X 2 X 3 X 1 X 2 X 3 Computa0on : Query vectors : Q = XW Q Split Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 75

  57. Example: CNN with Self-A,en.on Input Image CNN Features: C x H x W Cat image is free to use under the Pixabay License Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 76

  58. Example: CNN with Self-A,en.on Queries : C’ x H x W 1x1 Conv Input Image Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W 1x1 Conv Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 77

  59. Example: CNN with Self-A,en.on A<en0on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W 1x1 Conv Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 78

  60. Example: CNN with Self-A,en.on A<en0on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 79

  61. Example: CNN with Self-A,en.on A<en0on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x C x H x H Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv 1x1 Conv Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 80

  62. Example: CNN with Self-A,en.on Residual Connec0on A<en0on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x C x H x W Keys : + CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv 1x1 Conv Self-AEen$on Module Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 81

  63. Three Ways of Processing Sequences Recurrent Neural Network y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 Works on Ordered Sequences (+) Good at long sequences: Aher one RNN layer, h T ”sees” the whole sequence (-) Not parallelizable: need to compute hidden states sequen0ally Jus$n Johnson October 23, 2019 Lecture 13 - 82

  64. Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu$on y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Works on Ordered Sequences Works on Mul0dimensional Grids (+) Good at long sequences: Aher (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole stack many conv layers for outputs sequence to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can compute hidden states sequen0ally be computed in parallel Jus$n Johnson October 23, 2019 Lecture 13 - 83

  65. Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu$on Self-AEen$on Y 1 Y 2 Y 3 Product(→), Sum(↑) y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 V 3 A 2,3 A 3,3 A 1,3 V 2 A 1,2 A 2,2 A 3,2 V 1 A 1,1 A 2,1 A 3,1 Softmax(↑) K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 1 E 2,1 E 3,1 E 1,1 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Q 1 Q 2 Q 3 X 1 X 2 X 3 Works on Ordered Sequences Works on Sets of Vectors Works on Mul0dimensional Grids (+) Good at long sequences: Aher (-) Good at long sequences: aher one (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole self-a<en0on layer, each output stack many conv layers for outputs sequence “sees” all inputs! to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can compute hidden states sequen0ally be computed in parallel be computed in parallel (-) Very memory intensive Jus$n Johnson October 23, 2019 Lecture 13 - 84

  66. Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu$on Self-AEen$on Y 1 Y 2 Y 3 Product(→), Sum(↑) y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 V 3 A 2,3 A 3,3 A 1,3 AEen$on is all you need V 2 A 1,2 A 2,2 A 3,2 V 1 A 1,1 A 2,1 A 3,1 Softmax(↑) K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 1 E 2,1 E 3,1 E 1,1 Vaswani et al, NeurIPS 2017 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Q 1 Q 2 Q 3 X 1 X 2 X 3 Works on Ordered Sequences Works on Sets of Vectors Works on Mul0dimensional Grids (+) Good at long sequences: Aher (-) Good at long sequences: aher one (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole self-a<en0on layer, each output stack many conv layers for outputs sequence “sees” all inputs! to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can compute hidden states sequen0ally be computed in parallel be computed in parallel (-) Very memory intensive Jus$n Johnson October 23, 2019 Lecture 13 - 85

  67. The Transformer x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 86

  68. The Transformer All vectors interact Self-AEen$on with each other x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 87

  69. The Transformer Residual connec$on + All vectors interact Self-AEen$on with each other x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 88

  70. The Transformer Recall Layer Normaliza0on : Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) shiD: 𝛾 (Shape: D) 𝜈 i = (1/D)∑ j h i,j (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 ) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec$on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AEen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 89

  71. The Transformer Recall Layer Normaliza0on : Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiD: 𝛾 (Shape: D) on each vector 𝜈 i = (1/D)∑ j h i,j (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 ) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec$on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AEen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 90

  72. The Transformer Recall Layer Normaliza0on : Residual connec$on + Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiD: 𝛾 (Shape: D) on each vector 𝜈 i = (1/D)∑ j h i,j (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 ) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec$on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AEen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 91

  73. The Transformer y 1 y 2 y 3 y 4 Layer Normaliza$on Recall Layer Normaliza0on : Residual connec$on + Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiD: 𝛾 (Shape: D) on each vector 𝜈 i = (1/D)∑ j h i,j (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 ) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec$on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AEen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 92

  74. The Transformer y 1 y 2 y 3 y 4 Transformer Block: Layer Normaliza$on Input : Set of vectors x + Output : Set of vectors y MLP MLP MLP MLP Self-aEen$on is the only interac$on between vectors! Layer Normaliza$on Layer norm and MLP work + independently per vector Self-AEen$on Highly scalable, highly parallelizable x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 93

  75. The Transformer Layer Normalization + MLP MLP MLP MLP Transformer Block: Layer Normalization + Input : Set of vectors x Self-Attention Output : Set of vectors y A Transformer is a sequence Layer Normalization of transformer blocks + Self-aEen$on is the only MLP MLP MLP MLP interac$on between vectors! Vaswani et al: Layer Normalization + 12 blocks, D Q =512, 6 heads Self-Attention Layer norm and MLP work Layer Normalization independently per vector + MLP MLP MLP MLP Highly scalable, highly Layer Normalization + parallelizable Self-Attention Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 94

  76. The Transformer: Transfer Learning Layer Normalization + MLP MLP MLP MLP Layer Normalization “ImageNet Moment for Natural Language Processing” + Self-Attention Pretraining : Layer Normalization + Download a lot of text from the internet MLP MLP MLP MLP Layer Normalization + Train a giant Transformer model for language modeling Self-Attention Layer Normalization + Finetuning: MLP MLP MLP MLP Fine-tune the Transformer on your own NLP task Layer Normalization + Self-Attention Devlin et al, "BERT: Pre-training of Deep Bidirec$onal Transformers for Language Understanding", EMNLP 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 95

  77. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Justin Johnson October 23, 2019 Lecture 13 - 96

  78. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB Devlin et al, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", EMNLP 2018 Justin Johnson October 23, 2019 Lecture 13 - 97

  79. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) Yang et al, XLNet: Generalized Autoregressive Pretraining for Language Understanding", 2019 Liu et al, "RoBERTa: A Robustly Optimized BERT Pretraining Approach", 2019 Justin Johnson October 23, 2019 Lecture 13 - 98

  80. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 12 768 ? 117M 40 GB GPT-2 24 1024 ? 345M 40 GB GPT-2 36 1280 ? 762M 40 GB GPT-2 48 1600 ? 1.5B 40 GB Radford et al, "Language models are unsupervised multitask learners", 2019 Justin Johnson October 23, 2019 Lecture 13 - 99

  81. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 12 768 ? 117M 40 GB GPT-2 24 1024 ? 345M 40 GB GPT-2 36 1280 ? 762M 40 GB GPT-2 48 1600 ? 1.5B 40 GB Megatron-LM 40 1536 16 1.2B 174 GB 64x V100 GPU Megatron-LM 54 1920 20 2.5B 174 GB 128x V100 GPU Megatron-LM 64 2304 24 4.2B 174 GB 256x V100 GPU (10 days) Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days) Shoeybi et al, "Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism", 2019 Justin Johnson October 23, 2019 Lecture 13 - 100

Recommend


More recommend