Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia - PowerPoint PPT Presentation

Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia Tech Slide Credits: Andrej Karpathy, Justin Johnson, Dhruv Batra

Recall: Recurrent Neural Networks Image Credit: Andrej Karpathy

Sequence-to-Sequence with RNNs Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ Encoder: h t = f W (x t , h t-1 ) h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 we are eating bread Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ From final hidden state predict: Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 c x 1 x 2 x 3 x 4 we are eating bread Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos From final hidden state predict: y 1 Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 s 1 c x 1 x 2 x 3 x 4 y 0 we are eating bread [START] Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo From final hidden state predict: y 1 y 2 Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 s 1 s 2 c x 1 x 2 x 3 x 4 y 0 y 1 we are eating bread [START] estamos Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo pan [STOP] From final hidden state predict: y 1 y 2 y 3 y 4 Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 c x 1 x 2 x 3 x 4 y 0 y 1 y 2 y 3 we are eating bread [START] estamos comiendo pan Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo pan [STOP] Problem: Input sequence y 1 y 2 y 3 y 4 bottlenecked through Encoder: h t = f W (x t , h t-1 ) fixed-sized vector. h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 c x 1 x 2 x 3 x 4 y 0 y 1 y 2 y 3 we are eating bread [START] estamos comiendo pan Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo pan [STOP] Problem: Input sequence y 1 y 2 y 3 y 4 bottlenecked through Encoder: h t = f W (x t , h t-1 ) fixed-sized vector. h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 c x 1 x 2 x 3 x 4 y 0 y 1 y 2 y 3 we are eating bread [START] estamos comiendo pan Idea: use new context vector at each step of decoder! Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs and Attention Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ From final hidden state: Encoder: h t = f W (x t , h t-1 ) Initial decoder state s 0 h 1 h 2 h 3 h 4 s 0 x 1 x 2 x 3 x 4 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores e t,i = f att (s t-1 , h i ) (f att is an MLP) From final hidden state: e 11 e 12 e 13 e 14 Initial decoder state s 0 h 1 h 2 h 3 h 4 s 0 x 1 x 2 x 3 x 4 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores e t,i = f att (s t-1 , h i ) (f att is an MLP) a 11 a 12 a 13 a 14 Normalize alignment scores to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: e 11 e 12 e 13 e 14 Initial decoder state s 0 h 1 h 2 h 3 h 4 s 0 x 1 x 2 x 3 x 4 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 c t = ∑ i a t,i h i x 1 x 2 x 3 x 4 c 1 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores estamos to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: y 1 e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 s 1 c t = ∑ i a t,i h i Use context vector in x 1 x 2 x 3 x 4 c 1 y 0 decoder: s t = g U (y t-1 , s t-1 , c t ) we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores estamos to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: y 1 e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 s 1 c t = ∑ i a t,i h i Use context vector in x 1 x 2 x 3 x 4 c 1 y 0 decoder: s t = g U (y t-1 , s t-1 , c t ) we are eating bread This is all differentiable! Do not supervise attention weights – backprop through everything Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores estamos to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: y 1 e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 s 1 c t = ∑ i a t,i h i Intuition : Context vector Use context vector in attends to the relevant x 1 x 2 x 3 x 4 c 1 y 0 decoder: s t = g U (y t-1 , s t-1 , c t ) part of the input sequence “estamos” = “we are” we are eating bread This is all differentiable! Do not a 11 =0.45, a 12 =0.45, a 13 =0.05, a 14 =0.05 supervise attention weights – backprop through everything Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs Repeat: Use s 1 ✖︐ ✖︐ ✖︐ ✖︐ to compute new a 21 a 22 a 23 a 24 context vector c 2 estamos softmax y 1 e 21 e 22 e 23 e 24 + h 1 h 2 h 3 h 4 s 0 s 1 x 1 x 2 x 3 x 4 c 1 y 0 c 2 we are eating bread [START] Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs and Attention Repeat: Use s 1 ✖︐ ✖︐ ✖︐ ✖︐ to compute new a 21 a 22 a 23 a 24 context vector c 2 estamos comiendo softmax Use c 2 to y 1 y 2 compute s 2 , y 2 e 21 e 22 e 23 e 24 + h 1 h 2 h 3 h 4 s 0 s 1 s 2 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 we are eating bread [START] estamos Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Sequence-to-Sequence with RNNs and Attention Repeat: Use s 1 ✖︐ ✖︐ ✖︐ ✖︐ to compute new a 21 a 22 a 23 a 24 context vector c 2 estamos comiendo softmax Use c 2 to y 1 y 2 compute s 2 , y 2 e 21 e 22 e 23 e 24 + h 1 h 2 h 3 h 4 s 0 s 1 s 2 Intuition : Context vector attends to the relevant part x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 of the input sequence “comiendo” = “eating” we are eating bread [START] estamos Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia - PowerPoint PPT Presentation

Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia Tech Slide Credits: Andrej Karpathy, Justin Johnson, Dhruv Batra Recall: Recurrent Neural Networks Image Credit: Andrej Karpathy Sequence-to-Sequence with RNNs Input : Sequence

BERT 3.0 The New BERT Wheres Ernie????? Logging into Bert BERT now uses the same style logon as

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos

Transformers Willem Maes High Voltage Safety Transformers Willem Maes High Voltage Safety

BERT Bidirectional Encoder Representations from Transformers Introduction What is BERT?

Status of CIGRE JWG A2/B4-28 HVDC Converter Transformers HVDC Converter Transformers Ugo Piovan

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Lecture 12: Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

QUALITY PLAN POWER TRANSFORMERS MANUFACTURING CUSTOMER: SIDOR C.A. PROJECT: POWER TRANSFORMERS

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

Task Force on Partial Discharge Testing of Class I Power Transformers IEEE/PES Transformers

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

multi-hop attention and Transformers Outline Review of common (old fashioned) neural

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin,

INF4820 Algorithms for AI and NLP Semantic Spaces Murhaf Fares & Stephan Oepen

THE GREAT PACIFIC WAR: U.S. v. JAPAN, 1940-1945 24 SEPTEMBER 2020: RISING EXPECTATIONS DENIED

#1: Meeting of Estates General - May, 1789 SUMMARY: Under the Old Regime , the people of France

Inference and Representation David Sontag New York University Lecture 12, Dec. 2, 2014

ON MICHEL KERVAIRES WORK IN SURGERY AND KNOT THEORY Andrew Ranicki (Edinburgh)

Lecture 6: Texture Tuesday, Sept 18 Graduate students Texture Problem set 1 extension ideas

Chapter 15: Roman Tragedy Quintus Ennius first major Roman-born playwright after Livius

Next class Presentation guidelines 20 mins for each team (random order) 15 mins

Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia - PowerPoint PPT Presentation

Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia Tech Slide Credits: Andrej Karpathy, Justin Johnson, Dhruv Batra Recall: Recurrent Neural Networks Image Credit: Andrej Karpathy Sequence-to-Sequence with RNNs Input : Sequence

BERT 3.0 The New BERT Wheres Ernie????? Logging into Bert BERT now uses the same style logon as

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos

Transformers Willem Maes High Voltage Safety Transformers Willem Maes High Voltage Safety

BERT Bidirectional Encoder Representations from Transformers Introduction What is BERT?

Status of CIGRE JWG A2/B4-28 HVDC Converter Transformers HVDC Converter Transformers Ugo Piovan

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Lecture 12: Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

QUALITY PLAN POWER TRANSFORMERS MANUFACTURING CUSTOMER: SIDOR C.A. PROJECT: POWER TRANSFORMERS

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

Task Force on Partial Discharge Testing of Class I Power Transformers IEEE/PES Transformers

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

multi-hop attention and Transformers Outline Review of common (old fashioned) neural

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin,

INF4820 Algorithms for AI and NLP Semantic Spaces Murhaf Fares &amp; Stephan Oepen

THE GREAT PACIFIC WAR: U.S. v. JAPAN, 1940-1945 24 SEPTEMBER 2020: RISING EXPECTATIONS DENIED

#1: Meeting of Estates General - May, 1789 SUMMARY: Under the Old Regime , the people of France

Inference and Representation David Sontag New York University Lecture 12, Dec. 2, 2014

ON MICHEL KERVAIRES WORK IN SURGERY AND KNOT THEORY Andrew Ranicki (Edinburgh)

Lecture 6: Texture Tuesday, Sept 18 Graduate students Texture Problem set 1 extension ideas

Chapter 15: Roman Tragedy Quintus Ennius first major Roman-born playwright after Livius

Next class Presentation guidelines 20 mins for each team (random order) 15 mins

INF4820 Algorithms for AI and NLP Semantic Spaces Murhaf Fares & Stephan Oepen