Transformer Sequence Models CSE354 - Spring 2020 Natural Language - PowerPoint PPT Presentation

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.

The Transformer Solution: Multi-head attention

Multi-head Attention

Transformer for Encoder-Decoder

Transformer for Encoder-Decoder sequence index (t)

Transformer for Encoder-Decoder Residualized Connections

Transformer for Encoder-Decoder residuals enable positional information to be passed along Residualized Connections

Transformer for Encoder-Decoder essentially, a language model

Transformer for Encoder-Decoder essentially, a language model Decoder blocks out future inputs

Transformer for Encoder-Decoder Add conditioning of the LM based on the encoder essentially, a language model

Transformer (as of 2017) “WMT-2014” Data Set. BLEU scores:

Transformer ● Utilize Self-Attention ● Simple att scoring function (dot product, scaled) ● Added linear layers for Q, K, and V ● Multi-head attention ● Added positional encoding ● Added residual connection ● Simulate decoding by masking

Transformer Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Bidirectional context by “masking” in the middle ● Easy to parallelize -- don’t need sequential ● A lot of layers, hidden states, attention heads. processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. She saw the man on the hill with the telescope. She [mask] the man on the hill [mask] the telescope.

BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. She saw the man on the hill with the telescope. Mask 1 in 7 words: Too few: expensive, less robust ● She [mask] the man on the hill [mask] the telescope. Too many: not enough context ●

BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. ● BERT-Base, Cased : 12-layer, 768-hidden, 12-heads , 110M parameters

BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. ● BERT-Base, Cased : 12-layer, 768-hidden, 12-heads , 110M parameters ● BERT-Large, Cased : 24-layer, 1024-hidden, 16-heads, 340M parameters ● BERT-Base, Multilingual Cased : 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

BERT (Devlin et al., 2019)

BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. (Devlin et al., 2019)

BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

tokenize into “word pieces” BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

BERT Performance: e.g. Question Answering https://rajpurkar.github.io/SQuAD-explorer/

Bert: Attention by Layers https://colab.research.google.com/drive/1vlOJ1lhdujVjfH857hvYKIdKPTD9Kid8 (Vig, 2019)

BERT: Pre-training; Fine-tuning 12 or 24 layers

BERT: Pre-training; Fine-tuning Novel classifier (e.g. sentiment classifier; stance detector...etc..) 12 or 24 layers

BERT: Pre-training; Fine-tuning [CLS] vector at start Novel classifier is supposed to (e.g. sentiment classifier; stance detector...etc..) capture meaning of whole sequence.

BERT: Pre-training; Fine-tuning Novel classifier [CLS] vector at start (e.g. sentiment classifier; stance detector...etc..) is supposed to capture meaning of avg whole sequence. Average of top layer (or second to top) also often used.

Extra Material:

Transformer Sequence Models CSE354 - Spring 2020 Natural Language - PowerPoint PPT Presentation

Transformer Sequence Models CSE354 - Spring 2020 Natural Language Processing Most NLP Tasks. E.g. Transformer Networks Sequence Tasks Transformers Language Modeling BERT Machine Translation Speech

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition)

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Lecture 13: Standard Modules CSE 140: Components and Design Techniques for Digital Systems Diba

CSE 140 Lecture 11 Standard Combinational Modules CK Cheng CSE Dept. UC San Diego 1 Part III

A Hierarchical Encoder-Decoder for Paragraph Summarization Farzaneh Mahdisoltani Department of

Roadmap for Section C.1 Windows Services for UNIX 3.5 NFS client/server Lightweight Directory

Machine Translation Contd Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 March 7, 2017

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

AGL ARM prototype development update T oward the AGL spec 2.0 definition Hisao Munakata Linux

Data-Driven Ensembles for Deep and Hard-Decision Hybrid Decoding International Symposium on

Transformer Sequence Models CSE354 - Spring 2020 Natural Language - PowerPoint PPT Presentation

Transformer Sequence Models CSE354 - Spring 2020 Natural Language Processing Most NLP Tasks. E.g. Transformer Networks Sequence Tasks Transformers Language Modeling BERT Machine Translation Speech

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition)

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Lecture 13: Standard Modules CSE 140: Components and Design Techniques for Digital Systems Diba

CSE 140 Lecture 11 Standard Combinational Modules CK Cheng CSE Dept. UC San Diego 1 Part III

A Hierarchical Encoder-Decoder for Paragraph Summarization Farzaneh Mahdisoltani Department of

Roadmap for Section C.1 Windows Services for UNIX 3.5 NFS client/server Lightweight Directory

Machine Translation Contd Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 March 7, 2017

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

AGL ARM prototype development update T oward the AGL spec 2.0 definition Hisao Munakata Linux

Data-Driven Ensembles for Deep and Hard-Decision Hybrid Decoding International Symposium on

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or