Efficient Training of BERT by Progressively Stacking Linyuan Gong, - PowerPoint PPT Presentation

Apr 13, 2023 •401 likes •505 views

Efficient Training of BERT by Progressively Stacking Linyuan Gong, Di He , Zhuohan Li, Tao Qin, Liwei Wang, Tie-Yan Liu Peking University & Microsoft Research Asia ICML | 2019 6/12/2019 Efficient Training of BERT by Progressively Stacking

Efficient Training of BERT by Progressively Stacking Linyuan Gong, Di He , Zhuohan Li, Tao Qin, Liwei Wang, Tie-Yan Liu Peking University & Microsoft Research Asia ICML | 2019 6/12/2019 Efficient Training of BERT by Progressively Stacking 1
BERT: Effective Model with Huge Costs Model Training Data 110M/330M 3.4B words 128K tokens * parameters (enwiki + book) 1M updates 4 Days on 4 TPUs or 23 Days on 4 Tesla P40 GPUs 6/12/2019 Efficient Training of BERT by Progressively Stacking 2
Attention Distributions of BERT Neighborhood & [CLS] High-level layers Similar! Low-level layers 6/12/2019 Efficient Training of BERT by Progressively Stacking 3
Stacking 6/12/2019 Efficient Training of BERT by Progressively Stacking 4
Stacking Progressively Stacking Stacking 6/12/2019 Efficient Training of BERT by Progressively Stacking 5
Result ~25% 6/12/2019 Efficient Training of BERT by Progressively Stacking 6
Result 6/12/2019 Efficient Training of BERT by Progressively Stacking 7
Result CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE GLUE BERT- 52.1 93.5 88.9/ 87.1/ 71.2/ 84.6 / 90.5 66.4 78.3 Base 84.8 85.8 89.2 83.4 Stacking 56.2 93.9 88.2/ 84.2/ 70.4/ 84.4/ 90.1 67.0 78.4 83.9 82.5 88.7 84.2 6/12/2019 Efficient Training of BERT by Progressively Stacking 8
Take aways • Progressively stacking training for BERT is efficient • https://github.com/gonglinyuan/StackingBERT • Poster #50 • Towards a better understanding of Transformer • Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View, https://arxiv.org/pdf/1906.02762.pdf • Codes and model ckpts @ https://github.com/zhuohan123/macaron-net 6/12/2019 Efficient Training of BERT by Progressively Stacking 9

Recommend

for Efficient Adaptation in Multi-Task Learning Asa Cooper Stickland and Iain Murray University

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning Asa Cooper Stickland and Iain Murray University of Edinburgh Background: BERT Our model builds on BERT (Devlin et al., 2018), a powerful (and big) sentence

501 views • 9 slides

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding ( B idirectional

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding ( B idirectional E ncoder R epresentations from T ransformers) Jacob Devlin Google AI Language Pre-training in NLP Word embeddings are the basis of deep

1.56k views • 37 slides

BERT 3.0 The New BERT Wheres Ernie????? Logging into Bert BERT now uses the same style logon as

BERT 3.0 The New BERT Wheres Ernie????? Logging into Bert BERT now uses the same style logon as the other SEIMS+ applications. BERT Main Screen Icons on Main Screen of BERT There are several icons on the main screen for BERT. These icons are

590 views • 16 slides

BERT Bidirectional Encoder Representations from Transformers Introduction What is BERT?

BERT Bidirectional Encoder Representations from Transformers Introduction What is BERT? Latest language representational model BERT is conceptually simple and empirically powerful. One of the biggest challenges in natural language

839 views • 20 slides

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin,

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova Google AI Language CS330 Student Presentation Outline Background & Motivation Method

988 views • 27 slides

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin,mingweichang,kentonl,kristout } @google.com Abstract There are two

538 views • 16 slides

How about quantum computing? Bert de Jong wadejong@lbl.gov - 1 - What makes quantum computing

How about quantum computing? Bert de Jong wadejong@lbl.gov - 1 - What makes quantum computing so exciting? Speedups over classical computing Unbreakable encryption protocols Quantum simulation Efficient optimization

963 views • 47 slides

Architecture in Motion How Adyen achieved 100x Bert Wolters - EVP Technology bert@adyen.com

Architecture in Motion How Adyen achieved 100x Bert Wolters - EVP Technology bert@adyen.com Traditional vs Today Customers are in full control $1B 80% On Singles Day in China, $1Billion was On Black Friday in the U.S., nearly 80% of

1.44k views • 46 slides

Parameter efficient training of deep convolutional neural networks by dynamic sparse

Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization Hesham Mostafa (Intel AI) Xin Wang (Intel AI, Cerebras Systems) Easy : post-training (sparse) compression Hard : direct training of sparse

727 views • 9 slides

We Dont Need No Annotation (Efficient Training for Image Retrieval) Ondra Chum Visual

We Dont Need No Annotation (Efficient Training for Image Retrieval) Ondra Chum Visual Recognition Group Department of Cybernetics, Faculty of Electrical Engineering CTU in Prague Outline Algorithmic supervision for CNN training (local

980 views • 56 slides

T h e P o w e r o f S u p e r g r a v i t y S o l u t i o n s Bert

T h e P o w e r o f S u p e r g r a v i t y S o l u t i o n s Bert Vercnocke University of Amsterdam GGI, 8 September 2016 O v e r v i e w 1. Intro 2. Singularity resolution: Black hole microstates

700 views • 31 slides

BERT Basic Error Response Type Bert Why: Document WG Choice What: method to sign

BERT Basic Error Response Type Bert Why: Document WG Choice What: method to sign responses on line Pros ... Simplifies negative wild card responses Fairly simple signing model model Satisfies universal signing requirement

623 views • 5 slides

Data Structures in Java Session 15 Instructor: Bert Huang

Data Structures in Java Session 15 Instructor: Bert Huang http://www1.cs.columbia.edu/~bert/courses/3134 Announcements Homework 4 on website Midterm grades almost done No class on Tuesday Review Indexing by the key needs too

1.04k views • 26 slides

Data Structures in Java Session 24 Instructor: Bert Huang

Data Structures in Java Session 24 Instructor: Bert Huang http://www.cs.columbia.edu/~bert/courses/3134 Announcements Homework 6 due Dec. 10, last day of class Final exam Thursday, Dec. 17 th , 4-7 PM, Hamilton 602 (this room) same

690 views • 67 slides

FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep Neural Networks Hui Guan,

FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep Neural Networks Hui Guan, Laxmikant Kishor Mokadam, Xipeng Shen, Seung-Hwan Lim, Robert Patton 1 Build an image classifier? Deep Neural Network (DNN) CPU GPU Training

957 views • 41 slides

Contextual Word Representations with BERT and Other Pre-trained Language Models Jacob Devlin

Contextual Word Representations with BERT and Other Pre-trained Language Models Jacob Devlin Google AI Language History and Background Pre-training in NLP Word embeddings are the basis of deep learning for NLP king queen [-0.5, -0.9,

852 views • 43 slides

for Data-Efficient GAN Training Shengyu Zhao 1,2 Ji Lin 1 Jun-Yan Zhu 3,4 Song Han 1 Zhijian Liu 1

Differentiable Augmentation for Data-Efficient GAN Training Shengyu Zhao 1,2 Ji Lin 1 Jun-Yan Zhu 3,4 Song Han 1 Zhijian Liu 1 1 MIT 2 IIIS, Tsinghua University 3 Adobe Research 4 CMU NeurIPS 2020 Data Is Expensive Computation Algorithm

908 views • 12 slides

Control, inference and learning Bert Kappen : SNN Donders Institute, Radboud University, Nijmegen

Control, inference and learning Bert Kappen : SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 21, 2015 Bert Kappen Why control theory? A theory for intelligent behaviour: - neuroscience Bert Kappen Oxford

791 views • 59 slides

Data Structures in Java Session 17 Instructor: Bert Huang

Data Structures in Java Session 17 Instructor: Bert Huang http://www.cs.columbia.edu/~bert/courses/3134 Announcements Homework 4 due Homework 5 posted All-pairs shortest paths Review Graphs Topological Sort Print out a

278 views • 25 slides

Data Structures in Java Session 5 Instructor: Bert Huang

Data Structures in Java Session 5 Instructor: Bert Huang http://www1.cs.columbia.edu/~bert/courses/3134 Announcements Homework 1 is due now. Late penalty in effect Homework 2 released on website Due Oct. 6 th at 5:40 PM (14

369 views • 25 slides

Data Structures in Java Session 16 Instructor: Bert Huang

Data Structures in Java Session 16 Instructor: Bert Huang http://www.cs.columbia.edu/~bert/courses/3134 Announcements Homework 4 due next class Midterm grades posted. Avg: 79/90 Remaining grades: hw4, hw5, hw6 25% Final

343 views • 32 slides

Integrating control, inference and learning. Is it what robots should be doing? Bert Kappen SNN

Integrating control, inference and learning. Is it what robots should be doing? Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 18, 2016 Bert Kappen Optimal control theory Given a current state

905 views • 44 slides

Data Structures in Java Session 7 Instructor: Bert Huang

Data Structures in Java Session 7 Instructor: Bert Huang http://www1.cs.columbia.edu/~bert/courses/3134 Announcements Homework 2 released on website Due Oct. 6 th at 5:40 PM (7 days) Homework 1 solutions posted Post homework to

1.25k views • 19 slides

L 2 -GCN: Layer-Wise and Learned Efficient Training of Graph Convolutional Networks Yuning You * ,

L 2 -GCN: Layer-Wise and Learned Efficient Training of Graph Convolutional Networks Yuning You * , Tianlong Chen * , Zhangyang Wang, Yang Shen Texas A&M University * Equal Contribution Department of Electrical and Computer Engineering 1

215 views • 8 slides