Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia Tech Slide Credits: Andrej Karpathy, Justin Johnson, Dhruv Batra
Recall: Recurrent Neural Networks Image Credit: Andrej Karpathy
Sequence-to-Sequence with RNNs Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ Encoder: h t = f W (x t , h t-1 ) h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 we are eating bread Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ From final hidden state predict: Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 c x 1 x 2 x 3 x 4 we are eating bread Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos From final hidden state predict: y 1 Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 s 1 c x 1 x 2 x 3 x 4 y 0 we are eating bread [START] Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo From final hidden state predict: y 1 y 2 Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 s 1 s 2 c x 1 x 2 x 3 x 4 y 0 y 1 we are eating bread [START] estamos Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo pan [STOP] From final hidden state predict: y 1 y 2 y 3 y 4 Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 c x 1 x 2 x 3 x 4 y 0 y 1 y 2 y 3 we are eating bread [START] estamos comiendo pan Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo pan [STOP] Problem: Input sequence y 1 y 2 y 3 y 4 bottlenecked through Encoder: h t = f W (x t , h t-1 ) fixed-sized vector. h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 c x 1 x 2 x 3 x 4 y 0 y 1 y 2 y 3 we are eating bread [START] estamos comiendo pan Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo pan [STOP] Problem: Input sequence y 1 y 2 y 3 y 4 bottlenecked through Encoder: h t = f W (x t , h t-1 ) fixed-sized vector. h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 c x 1 x 2 x 3 x 4 y 0 y 1 y 2 y 3 we are eating bread [START] estamos comiendo pan Idea: use new context vector at each step of decoder! Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs and Attention Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ From final hidden state: Encoder: h t = f W (x t , h t-1 ) Initial decoder state s 0 h 1 h 2 h 3 h 4 s 0 x 1 x 2 x 3 x 4 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores e t,i = f att (s t-1 , h i ) (f att is an MLP) From final hidden state: e 11 e 12 e 13 e 14 Initial decoder state s 0 h 1 h 2 h 3 h 4 s 0 x 1 x 2 x 3 x 4 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores e t,i = f att (s t-1 , h i ) (f att is an MLP) a 11 a 12 a 13 a 14 Normalize alignment scores to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: e 11 e 12 e 13 e 14 Initial decoder state s 0 h 1 h 2 h 3 h 4 s 0 x 1 x 2 x 3 x 4 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 c t = ∑ i a t,i h i x 1 x 2 x 3 x 4 c 1 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores estamos to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: y 1 e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 s 1 c t = ∑ i a t,i h i Use context vector in x 1 x 2 x 3 x 4 c 1 y 0 decoder: s t = g U (y t-1 , s t-1 , c t ) we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores estamos to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: y 1 e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 s 1 c t = ∑ i a t,i h i Use context vector in x 1 x 2 x 3 x 4 c 1 y 0 decoder: s t = g U (y t-1 , s t-1 , c t ) we are eating bread This is all differentiable! Do not supervise attention weights – backprop through everything Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores estamos to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: y 1 e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 s 1 c t = ∑ i a t,i h i Intuition : Context vector Use context vector in attends to the relevant x 1 x 2 x 3 x 4 c 1 y 0 decoder: s t = g U (y t-1 , s t-1 , c t ) part of the input sequence “estamos” = “we are” we are eating bread This is all differentiable! Do not a 11 =0.45, a 12 =0.45, a 13 =0.05, a 14 =0.05 supervise attention weights – backprop through everything Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs Repeat: Use s 1 ✖︐ ✖︐ ✖︐ ✖︐ to compute new a 21 a 22 a 23 a 24 context vector c 2 estamos softmax y 1 e 21 e 22 e 23 e 24 + h 1 h 2 h 3 h 4 s 0 s 1 x 1 x 2 x 3 x 4 c 1 y 0 c 2 we are eating bread [START] Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs and Attention Repeat: Use s 1 ✖︐ ✖︐ ✖︐ ✖︐ to compute new a 21 a 22 a 23 a 24 context vector c 2 estamos comiendo softmax Use c 2 to y 1 y 2 compute s 2 , y 2 e 21 e 22 e 23 e 24 + h 1 h 2 h 3 h 4 s 0 s 1 s 2 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 we are eating bread [START] estamos Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson
Sequence-to-Sequence with RNNs and Attention Repeat: Use s 1 ✖︐ ✖︐ ✖︐ ✖︐ to compute new a 21 a 22 a 23 a 24 context vector c 2 estamos comiendo softmax Use c 2 to y 1 y 2 compute s 2 , y 2 e 21 e 22 e 23 e 24 + h 1 h 2 h 3 h 4 s 0 s 1 s 2 Intuition : Context vector attends to the relevant part x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 of the input sequence “comiendo” = “eating” we are eating bread [START] estamos Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson
Recommend
More recommend