Language Models and Transfer Learning Yifeng Tao School of Computer - PowerPoint PPT Presentation

Introduction to Machine Learning Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from various sources (see reference page) Yifeng Tao Carnegie Mellon University 1

What is a Language Model o A statistical language model is a probability distribution over sequences of words o Given such a sequence, say of length m, it assigns a probability to the whole sequence: o Main problem: data sparsity [Slide from https://en.wikipedia.org/wiki/Language_model.] Yifeng Tao Carnegie Mellon University 2

Unigram model: Bag of words o General probability distribution: o Unigram model assumption: o Essentially, bag of words model o Estimation of unigram params: count word frequency in the doc [Slide from https://en.wikipedia.org/wiki/Language_model.] Yifeng Tao Carnegie Mellon University 3

n-gram model o n-gram assumption: o Estimation of n-gram params: [Slide from https://en.wikipedia.org/wiki/Language_model.] Yifeng Tao Carnegie Mellon University 4

Word2Vec o Word2Vec: Learns distributed representations of words o Continuous bag-of-words (CBOW) o Predicts current word from a window of surrounding context words o Continuous skip-gram o Uses current word to predict surrounding window of context words o Slower but does a better job for infrequent words [Slide from https://www.tensorflow.org/tutorials/representation/word2vec.] Yifeng Tao Carnegie Mellon University 5

Skip-gram Word2Vec o All words: o Parameters of skip-gram word2vec model o Word embedding for each word: o Context embedding for each word: o Assumption: [Slide from https://www.tensorflow.org/tutorials/representation/word2vec.] Yifeng Tao Carnegie Mellon University 6

Distributed Representations of Words o The trained parameters of words in skip-gram word2vec model o Semantics and embedding space [Slide from https://www.tensorflow.org/tutorials/representation/word2vec.] Yifeng Tao Carnegie Mellon University 7

Word Embeddings in Transfer Learning o Transfer learning: o Labeled data are limited o Unlabeled text corpus enormous o Pretrained word embeddings can be transferred to other supervised tasks. E.g., POS, NER, QA, MT, Sentiment classification [Slide from Matt Gormley .] Yifeng Tao Carnegie Mellon University 8

SOTA Language Models: ELMo o Embeddings from Language Models: ELMo o Fits full conditional probability in forward direction: o Fits full conditional probability in both directions using LSTM: [Slide from https://arxiv.org/pdf/1802.05365.pdf and https :// arxiv.org/abs/1810.04805.] Yifeng Tao Carnegie Mellon University 9

SOTA Language Models: OpenAI GPT & BERT o Uses transformer other than LSTM to model language o OpenAI GPT: single direction o BERT: bi-direction [Slide from https://arxiv.org/abs /1810.04805 .] Yifeng Tao Carnegie Mellon University 10

SOTA Language Models: BERT o Additional language modeling task: predict whether sentences come from same paragraph. [Slide from https://arxiv.org/abs /1810.04805 .] Yifeng Tao Carnegie Mellon University 11

SOTA Language Models: BERT o Instead of extract embeddings and hidden layer outputs, can be fine-tuned to specific supervised learning tasks. [Slide from https://arxiv.org/abs/1810.04805.] Yifeng Tao Carnegie Mellon University 12

The Transformer and Attention Mechanism o An encoder-decoder structure o Our focus: encoder and attention mechanism [Slide from https://jalammar.github.io/illustrated-transformer/.] Yifeng Tao Carnegie Mellon University 13

The Transformer and Attention Mechanism o Self-attention o Ignores positions of words, assign weights globally. o Can be parallelized, in contrast to LSTM. o E.g., the attention weights related to word “it_”: [Slide from https://jalammar.github.io/illustrated-transformer/.] Yifeng Tao Carnegie Mellon University 14

Self-attention Mechanism [Slide from https://jalammar.github.io/illustrated-transformer/.] Yifeng Tao Carnegie Mellon University 15

Self-attention Mechanism o More… o https://jalammar.githu b.io/illustrated- transformer/ [Slide from https://jalammar.github.io/illustrated-transformer/.] Yifeng Tao Carnegie Mellon University 16

Take home message o Language models suffer from data sparsity o Word2vec portrays language probability using distributed word embedding parameters o ELMo, OpenAI GPT, BERT model language using deep neural networks o Pre-trained language models or their parameters can be transferred to supervised learning problems in NLP o Self-attention has the advantage over LSTM that it can be parallelized and consider interactions across the whole sentence Yifeng Tao Carnegie Mellon University 17

References o Wikipedia: https://en.wikipedia.org/wiki/Language_model o Tensorflow. Vector Representations of Words: https://www.tensorflow.org/tutorials/representation/word2vec o Matt Gormley. 10601 Introduction to Machine Learning: http://www.cs.cmu.edu/~mgormley/courses/10601/index.html o Matthew E. Peters et al. Deep contextualized word representations: https://arxiv.org/pdf/1802.05365.pdf o Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: https://arxiv.org/abs/1810.04805 o Jay Alammar. The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/ Yifeng Tao Carnegie Mellon University 18

Language Models and Transfer Learning Yifeng Tao School of Computer - PowerPoint PPT Presentation

Introduction to Machine Learning Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from various sources (see reference page) Yifeng Tao Carnegie Mellon University 1 What is

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Industrial Transfer Learning Introduction to Industrial Transfer Learning Industrial Transfer

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Transfer United: Partnerships to Foster Transfer Student Success Tuesday, November 5 th

Transfer Learning Eu Wern Teh What are we covering? Why transfer learning? Fine

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Coupling 3D radiative transfer models with soil vegetation transfer models for sparse vegetation

Technology Transfer and Commercialisation 1 05/06/2015 1 Tech Transfer and Commercialisation

Knowledge Transfer Using Latent Variable Models Ayan Acharya UT Austin, Department of ECE July

Transfer learning with neural language models CS 685, Spring 2020 Advanced Natural Language

Transfer Transfer Transitions: Transitions: First Semester First Semester Persistence and

Remit #2 Elimination of Transfer and Settlement What does transfer mean? Transfer

Tips for Creating Effective Slides Slides are a vital part of communicating your message during a

Starter: Identify where is scene is set. Explain what is happening in the picture. Predict what

Initial Therapy Sagar Lonial, MD Chair and Professor Department of Hematology and Medical

Hemodialysis Headache Headache Cheng-Yang Hsieh 99-04-24 Sin-Lau Hospital the Presbyterian

Making effective slides for presentations Max Masnick, PhD 1 Purpose of slides The primary

POIR 613: Computational Social Science Pablo Barber a School of International Relations

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 1:

9/2/2015

Sambuz

Useful Links

Newsletter

Mail Us