Efficient Contextual Representation Learning With Continuous Outputs - PowerPoint PPT Presentation

1 Efficient Contextual Representation Learning With Continuous Outputs Kai-Wei Chang Liunian Harold Li Patrick H. Chen Cho-Jui Hsieh UCLA UCLA UCLA UCLA

2 Motivation: Efficient Contextual Representation Learning Energy implication of popular NLP models (Strubell et al., 2019).

3 Background: Language Model Pre-training Language Model Objectives: forward / backward / masked Softmax Layer … Sequence Encoder: LSTM / Transformer C C C C C C Input Layer: Subwords / CNN dog The quick An illustration of popular pre-trained language models, such as ELMo, GPT, and BERT.

4 Background: Softmax Layer <eos> quick brown Loss function with a softmax layer: … C C C c: context vector from the sequence encoder C C C W: V x m matrix, with V being the vocabulary size V could become extremely large (800K for ELMo) dog The quick W takes up 80% of parameters of ELMo Forward language modeling of ELMo Softmax layer becomes the speed bottleneck!

5 Approach: Accelerating Language Model Training with Continuous Output Loss function with a continuous output layer*: <eos> quick brown c: context vector from the sequence encoder … C C C w: pre-trained word embedding of w C C C d: distance function such as cosine distance Predicting the word embedding instead of the word! dog The quick Forward language modeling of ELMo *Von mises-fisher loss for training sequence to sequence models with continuous outputs. Sachin Kumar and Yulia Tsvetkov. 2018.

6 Approach: Computational Efficiency Related work Time complexity: O(|vocabulary|) -> O(|embedding|) Sampling Negligible Adaptive softmax Subword Trainable parameter size: … Hundreds of Millions -> 0 80% parameter reduction for ELMo Significant efficiency improvement over existing methods

7 Approach: Computational Efficiency Time complexity: Optimizer overhead O(|vocabulary|) -> O(|embedding|) Negligible GPU memory consumption Trainable parameter size: Hundreds of Millions -> 0 Communication cost 80% parameter reduction for ELMo Efficiency improvement of the output layer Efficiency improvement for the entire model ELMo training: 14 days x 3 GPUs -> 2.5 days x 4 GPUs

8 Approach: Open-vocabulary Training Open-vocabulary word embedding Loss function with a continuous output layer: such as FastText / MIMICK: w: pre-trained word embedding of w What if w is not in the vocabulary? MIMICK (Pinter et al., 2017)

9 Experiment All models pre-trained on One Billion Word Benchmark for 10 epochs. ELMo-C, ELMo-A, and ELMo-Sub trained with the exact same hyper-parameters. ELMo-A achieves a perplexity of 35.8, lower than 39.7 of the original ELMo.

10 Experiment Training time (Day x GPU), batch size (per GPU), trainable parameters of four ELMo variants ELMo-C is 4.2x faster and 6x more memory efficient than ELMo

11 Experiment Training time (Day x GPU), batch size (per GPU), trainable parameters of four ELMo variants ELMo-A and ELMo-Sub are more efficient than ELMo ELMo-C is still 1.6x - 2.3x faster

12 Experiment Performance on five downstream tasks following settings of the original ELMo ELMo-C is comparable with ELMo on four tasks except SRL.

13 Experiment Performance on five downstream tasks following settings of the original ELMo ELMo-C rivals or outperforms ELMo-A and ELMo-Sub.

14 Analysis: The Continuous Output Layer with Different Sequence Encoders Time needed to finish training on one million words using 4 GPUs. Consistent efficiency improvement over other variants (1.44x - 8.31x), even when the sequence encoder is very large.

15 Conclusion Predicting word embedding instead of softmaxing accelerates ELMo training The resulting model ELMo-C retains comparable performance as ELMo Computational efficiency sustains when applied to large transformers https://github.com/uclanlp/ELMO-C

Efficient Contextual Representation Learning With Continuous Outputs - PowerPoint PPT Presentation

1 Efficient Contextual Representation Learning With Continuous Outputs Kai-Wei Chang Liunian Harold Li Patrick H. Chen Cho-Jui Hsieh UCLA UCLA UCLA UCLA 2 Motivation: Efficient Contextual Representation Learning Energy implication of

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Stable and Efficient Representation Learning with Nonnegativity Constraints Tsung-Han Lin and

Contextual Inquiry Tim Clark (488232) March 21, 2011 Tim Clark (488232) Contextual Inquiry

Contextual Inquiry SWEN-445 Contextual Inquiry is the process of discovering what users cannot

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Contextual Analysis SWEN-444 Selected material from The UX Book , Hartson & Pyla What is

Intro to Contextual Inquiry Selected material from The UX Book , Hartson &

Contextual Inquiry SWEN-444 Contextual Inquiry is the process of discovering what users cannot

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

K K Knowledge Knowledge l d l d Representation Representation Representation

High Level Synthesis Design Representation Intermediate representation essential for efficient

Advanced Macroeconomics 1. Introducing the IS-MP-PC Model Karl Whelan School of Economics, UCD

Finite State Machines CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy,

Hao Chen University of California, Davis Web services are highly attractive targets Over

Convolutional Networks Lecture slides for Chapter 9 of Deep Learning Ian Goodfellow 2016-09-12

Multiple-output Gaussian processes Mauricio A. Alvarez Department of Computer Science, The

ELEC 3040/3050 Lab #7 PWM Waveform Generation References: STM32L1xx Technical Reference Manual

Introduction to Neural Networks Jakob Verbeek INRIA, Grenoble Picture: Omar U. Florez Homework,

Tutorial Slides for Week 10 ENEL 353: Digital Circuits Fall 2015 Term Steve Norman, PhD, PEng

Efficient Contextual Representation Learning With Continuous Outputs - PowerPoint PPT Presentation

1 Efficient Contextual Representation Learning With Continuous Outputs Kai-Wei Chang Liunian Harold Li Patrick H. Chen Cho-Jui Hsieh UCLA UCLA UCLA UCLA 2 Motivation: Efficient Contextual Representation Learning Energy implication of

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Experimental Design &amp; Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Stable and Efficient Representation Learning with Nonnegativity Constraints Tsung-Han Lin and

Contextual Inquiry Tim Clark (488232) March 21, 2011 Tim Clark (488232) Contextual Inquiry

Contextual Inquiry SWEN-445 Contextual Inquiry is the process of discovering what users cannot

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Contextual Analysis SWEN-444 Selected material from The UX Book , Hartson &amp; Pyla What is

Intro to Contextual Inquiry Selected material from The UX Book , Hartson &amp;

Contextual Inquiry SWEN-444 Contextual Inquiry is the process of discovering what users cannot

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

K K Knowledge Knowledge l d l d Representation Representation Representation

High Level Synthesis Design Representation Intermediate representation essential for efficient

Advanced Macroeconomics 1. Introducing the IS-MP-PC Model Karl Whelan School of Economics, UCD

Finite State Machines CS 3410 Computer System Organization &amp; Programming [K. Bala, A. Bracy,

Hao Chen University of California, Davis Web services are highly attractive targets Over

Convolutional Networks Lecture slides for Chapter 9 of Deep Learning Ian Goodfellow 2016-09-12

Multiple-output Gaussian processes Mauricio A. Alvarez Department of Computer Science, The

ELEC 3040/3050 Lab #7 PWM Waveform Generation References: STM32L1xx Technical Reference Manual

Introduction to Neural Networks Jakob Verbeek INRIA, Grenoble Picture: Omar U. Florez Homework,

Tutorial Slides for Week 10 ENEL 353: Digital Circuits Fall 2015 Term Steve Norman, PhD, PEng

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Contextual Analysis SWEN-444 Selected material from The UX Book , Hartson & Pyla What is

Intro to Contextual Inquiry Selected material from The UX Book , Hartson &

Finite State Machines CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy,