Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. Wallace Material in this lecture derived rom materials created by Jay Alammar (http:// jalammar.github.io/illustrated-transformer/)

Some housekeeping • First, let’s talk midterm… • Mean: 70 (from 30s to high 90s) • I miscalibrated Q2 (average: 56%) ★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the same content as Q2 — you can use this to make up up to half (12.5) points on said question

Some housekeeping • First, let’s talk midterm… • Mean: 70 (from 30s to high 90s) • I miscalibrated Q2 (average: 56%) ★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the same content as Q2 — you can use this to make up up to half (12.5) points on said question. This will be released tonight; due date is flexible.

HW 4 • HW 4 will be released soon; due 3/24 (Tuesday)

Projects! • THURSDAY 3/13 Project proposal is due! • TUESDAY 3/17 Project pitches in class!

A remote possibility • There is a (increasingly) non-zero chance that Northeastern will move to holding all classes remotely in the coming days/weeks • In this case: Remote / recorded lectures; on-demand office hours, remotely; project presentations (+ pitches) will also have to be remote or recorded (will figure out!) • Keep an eye on Piazza for more updates

Today • Will introduce transformer networks, which are a type of neural networks that have come to dominate in NLP • To get there, will first review RNNs briefly

RNNs • Review [on board]

Transformers • Hey, maybe we can get rid of recurrence!

Attention mechanisms

ˆ y output layer T X c = = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word … embeddings This movie terrible … so

ˆ y output layer T X c = = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word word … … embeddings embeddings This movie so terrible …

ˆ y ˆ y output layer output layer T X c = = T α i h i X c = = α i h i i =1 i =1 … α 1 α 2 α T-1 α T … α 1 α 2 α T-1 α T Attention Attention h 1 h 2 h T-1 h T … h 1 h 2 h T-1 h T … … … BiLSTM BiLSTM … … word word word … … … embeddings embeddings embeddings This movie so terrible … This movie terrible … so

ˆ y output layer T X c = = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word … embeddings This movie so terrible …

ˆ y output layer T c = X = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word … embeddings This movie so terrible …

ˆ y output layer T X c = = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word … embeddings This movie so terrible …

ˆ y output layer T X c = = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word … embeddings This movie … so terrible

Transformer block source: http://jalammar.github.io/illustrated-transformer/

First, embed source: http://jalammar.github.io/illustrated-transformer/

Then transform source: http://jalammar.github.io/illustrated-transformer/

What is “self-attention”? source: http://jalammar.github.io/illustrated-transformer/

source: http://jalammar.github.io/illustrated-transformer/

This one weird trick source: http://jalammar.github.io/illustrated-transformer/

In matrices Learned source: http://jalammar.github.io/illustrated-transformer/

In matrices source: http://jalammar.github.io/illustrated-transformer/

Let’s implement… [notebook TODOs 1 & 2]

OK, but what is it used for?

Translation source: http://jalammar.github.io/illustrated-transformer/

Language modeling https://talktotransformer.com/

BERT BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin,mingweichang,kentonl,kristout } @google.com

Pre-train (self-supervise) then fine-tune : A winning combo

This is a thing now A Primer in BERTology: What we know about how BERT works Anna Rogers, Olga Kovaleva, Anna Rumshisky Department of Computer Science, University of Massachusetts Lowell Lowell, MA 01854 { arogers, okovalev, arum } @cs.uml.edu

MNLI NER SQuAD NSP Mask LM Mask LM Start/End Span C T 1 ... T N T [SEP] T 1 ’ ... T M ’ C T 1 ... T N T [SEP] T 1 ’ ... T M ’ BERT BERT BERT E 1 ... E N E [SEP] E 1 ’ ... E M ’ E 1 ... E N E [SEP] E 1 ’ ... E M ’ E [CLS] E [CLS] ... ... ... ... [CLS] Tok 1 Tok N [SEP] Tok 1 TokM [CLS] Tok 1 Tok N [SEP] Tok 1 TokM Masked Sentence A Question Masked Sentence B Paragraph Question Answer Pair Unlabeled Sentence A and B Pair Pre-training Fine-Tuning BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin,mingweichang,kentonl,kristout } @google.com

Self-Supervise an Encoder BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin,mingweichang,kentonl,kristout } @google.com

Self-Supervise an Encoder The cat is very cute

Self-Supervise an Encoder The cat is very cute X The [MASK] is very cute y cat

Let’s implement … [notebook TODO 3]

BERT details we did not consider • BERT actually uses word-pieces rather than entire words • Also uses “positional” embeddings in the inputs to give a sense of “location” in the sequence • Multiple self-attention “heads” • Deeper (12+ layers)

BERT details we did not consider • BERT actually uses word-pieces rather than entire words • Also uses “positional” embeddings in the inputs to give a sense of “location” in the sequence • Multiple self-attention “heads” • Deeper (12+ layers) • Residual + layer norms (prevents explosions/NaNs)

For a more detailed implementation … • See Sasha Rush’s excellent “annotated transformer”: http://nlp.seas.harvard.edu/2018/04/03/attention.html

Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. Wallace Material in this lecture derived rom materials created by Jay Alammar (http:// jalammar.github.io/illustrated-transformer/) Some housekeeping First, lets talk

An Exercise in An Exercise in Machine Learning Machine Learning

Machine Learning By Alex Scarlatos What is Machine Learning? Machine Learning is the process by

Machine Learning: Study of algorithms that improve their performance P at some task T

Traditional Machine Learning: Unsupervised Learning Juhan Nam Traditional Machine Learning

CS 335 Machine Learning What is Machine Learning? Dan Sheldon Spring 2019 What is Machine

Machine Learning Machine Learning: algorithms that use experience to improve their

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

1 Why Study Machine Learning? Why Study Machine Learning? Cognitive Science The Time is Ripe

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

Deep Learning: Intro Juhan Nam Review of Traditional Machine Learning The traditional machine

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

Machine Learning Modeling and Learning 15-110 Monday 4/13 Learning Goals Given a

Machine Learning @ Amazon Ralf Herbrich Amazon 6/29/17 1 Overview Machine Learning in

Softmax Classifier + SGD Todays Class Intro to Machine Learning What is Machine Learning?

Neural Networks for Machine Learning Lecture 1a Why do we need machine learning? Geoffrey Hinton

MACHINE LEARNING Overview 1 MACHINE LEARNING Oral Presentations of Projects Start at 9h15 am

Machine Learning for Music: Intro Juhan Nam Definition of Machine Learning Tom M. Mitchell

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Overview CS 446 What is machine learning? Machine learning : study of computational

Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult