Improving Transformer Optimization Through Better Initialization - PowerPoint PPT Presentation

Jul 21, 2023 •248 likes •484 views

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs 1 Transformer in Detail Removing Warmup: T-Fixup Agenda Experimental Results Summary 2

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang*, Felipe Perez*, Jimmy Ba, Maksims Volkovs 1
Transformer in Detail ● Removing Warmup: T-Fixup ● Agenda Experimental Results ● Summary ● 2
Transformer - Encoder-Decoder architecture - Residual backbone - Multi-Headed Attention in ResBlock - LayerNorm after every residual block 3
Training - Adam optimizer - Inverse square root learning rate decay - Learning rate warmup - 4
Necessity of Warmup - Gradient histogram 5
Necessity of Warmup LayerNorm in Backpropagation [2] - - x: input to Layer Normalization - d: dimension of x Error signal decreases with a large input 6
Necessity of Warmup LayerNorm in Backpropagation [2] - 7
Removing Warmup - Without LayerNorm: - Magnitude on backbone grows with layer depth 8
Removing Warmup - Without LayerNorm: - Magnitude on backbone grows with layer depth - With LayerNorm: - Reset to unit magnitude 9
Removing Warmup - Without LayerNorm: - Magnitude on backbone grows with layer depth - With LayerNorm: - Reset to unit magnitude - Parameter-Controller Growth 10
Removing Warmup Goal: Control the total change on the output of the transformer after a gradient update. Control output change in residual blocks: - Feedforward blocks as in Fixup - Theorem: For Attention blocks, this is controlled when: 11
Removing Warmup - T-Fixup Initialization - Xavier Initialization for all projection matrices - Gaussian initialization for embedding layers - Scale embedding layers and decoder parameters by (9N) -1/4 - Scale encoder parameters by 0.67N -1/4 12
Experimental Results 13
T-Fixup on Standard Transformer - T-Fixup achieves consistently higher performance with less structure 14
T-Fixup on Standard Transformer: gradients - Gradient and Adam Update Magnitudes - Vanilla Transformer Without Warmup - vanishing gradient - T-Fixup Without Warmup - stable error signal throughout training 15
T-Fixup on Deeper Transformer - T-Fixup outperforms all competitive models with equal or less layers 16
T-Fixup on Ultra-Deep Transformer - IWSLT’14 De-En dataset, 64(embed)-128(MLP hidden)-2(head) Transformer 17
T-Fixup on Large Batch Training - WMT’17 En-De Dataset, WMT base Transformer 18
Summary 19
Summary - Requirement for learning rate warmup: Adam + LayerNorm - T-Fixup Initialization - Superior performance on NMT - Ultra-Deep Transformer - Future Work 20
Acknowledgement 21
Thank you! Questions? Contact: Xiao Shi (Gary) Huang gary@layer6.ai 22
References [1]: Liu, L. etc. On the variance of the adaptive learning rate and beyond . In ICLR, 2020 [2]: Xiong, R. etc. On layer normalization in the transformer architecture . In ICML, 2020 [3]: Zhang, H. etc. Fixup initialization: residual learning without normalization , In ICLR, 2019 [4]: Wang. Q. etc. Learning deep transformer models for machine translation . In ACL, 2019 [5]: Zhang, B. etc. Improving deep transformer with depth-scaled initialization and merged attention . In EMNLP , 2019 [6]: Xu. H. etc. Why deep transformers are difficult to converge? From computation order to Lipschitz restricted parameter initialization . In Arxiv 23

Recommend

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the Transformer and Faster Transformer 1.0 New Features in Faster Transformer 2.0 Introduce the Faster Transformer 2.0 Faster Transformer 2.0 performance

932 views • 55 slides

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power transformer for West Cape substation to replace the failed unit Location Map Transformer Failure 2005 Installed Kuhlman transformer at West Cape

378 views • 8 slides

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough Electrical Engineering Consultant DuPont January 9, 2019 Transformer Types in DuPont Control power Distribution (substation) dry-type

467 views • 21 slides

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1 Calpine Transformer Program Fleet of combined-cycle and cogeneration plants Use both natural gas and steam to produce electricity

425 views • 20 slides

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC transformer link and how does it operate? How can a higher-power electronic transformer be built up in a modular way? Electronic DC Transformer

810 views • 13 slides

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Prof. S. Ben-Yaakov , DC-DC Converters [3- 1] Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1 Ideal transformer (voltages and currents) 3.3.2 Equivalent circuit of transformer (coupling,

463 views • 24 slides

ROCKBOX FABRIQ EDITION ITS TIME FOR FOR BETTER SOUND. BETTER DESIGN. BETTER SPECS.

ROCKBOX FABRIQ EDITION ITS TIME FOR FOR BETTER SOUND. BETTER DESIGN. BETTER SPECS. F A B ROCKBOX R I Q E D I T I O N A A NEW NEW GENERA GENERATION TION BETTER SOUND. BETTER DESIGN. BETTER SPECS. BETTER BETTER

874 views • 84 slides

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

Better Advice, Better Lives Monmouthshire County Citizens Advice Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives Introduction As you are no doubt aware, in 2012, the Government announced plans

84 views • 4 slides

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization 2 Outline Introduction to optimization

603 views • 59 slides

Architecture Research On Transport Information Services of EXPO 2010 Shanghai China Better City,

Architecture Research On Transport Information Services of EXPO 2010 Shanghai China Better City, Better Life! Better City, Better Life! Better Transport, Better EXPO 2010! Better Transport, Better EXPO 2010! Xue Meigen Xue Meigen Associate

339 views • 18 slides

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com About 100 Years Ago www.jeanchatzky.com How Is Your A Hot Relationship With Mess! Money? www.jeanchatzky.com How Would You Describe

424 views • 40 slides

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 & Submersible Transformer

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 & Submersible Transformer Galvanic Application Compatibility Chart (STGACC): This report contains pictures of underground submersible transformer CTQ (Critical to Quality)

186 views • 17 slides

Matrix Transformer Building Blocks for High Frequency Applications James Lau CWS (Coil

Matrix Transformer Building Blocks for High Frequency Applications James Lau CWS (Coil Winding Specialists) Matrix Transformer Building Blocks for High Frequency Applications Outline of the presentation 1. Matrix transformers, brief

659 views • 48 slides

Transformer at base of Tower Critical link to successful Wind Turbine connection to the grid

Wind Power Transformer Design By Philip J Hopkinson, PE Wind Power Transformer at base of Tower Critical link to successful Wind Turbine connection to the grid Subject of Many complaints 1. Gassing 2. Winding Failures 3. Contact

649 views • 37 slides

The Voltage Waveform of Transformer Core Halves with Magnetization and an Air Gap ... and Maybe

Introduction Fixed Cores Movable Cores Conclusion The Voltage Waveform of Transformer Core Halves with Magnetization and an Air Gap ... and Maybe Motion Too Matt Williams April 24, 2006 Matt Williams The Voltage Waveform of Transformer

592 views • 18 slides

High Voltage Pad-mount Transformer SEPTEMBER 5 - 7, 2018 What is the High Voltage Padmount

High Voltage Pad-mount Transformer SEPTEMBER 5 - 7, 2018 What is the High Voltage Padmount Transformer(HVPT) Substation? Traditional substation vs. HVPT SEPTEMBER 5 - 7, 2018 HVPT History In order to reduce its outlay in distribution,

679 views • 40 slides

Collective Impact for Youth Understanding how the principles of collective impact can support

Collective Impact for Youth Understanding how the principles of collective impact can support OYDC work and beyond 1 About Education Northwest Education Northwest is a regionally based nonprofit that works throughout the nation to create strong

1.15k views • 57 slides

ML4HMT: DCU Teams Overview Tsuyoshi Okita Dublin City University DCU Teams Overview Meta

ML4HMT: DCU Teams Overview Tsuyoshi Okita Dublin City University DCU Teams Overview Meta information DCU-Alignment: alignment information DCU-QE: quality information DCU-DA: domain ID information DCU-NPLM: latent variable

439 views • 26 slides

Region Proposal Network with Adaptive Convolution Thang Vu Hyunjun Jang Pham X.

Cascade RPN: Delving into High-Quality Region Proposal Network with Adaptive Convolution Thang Vu Hyunjun Jang Pham X. Trung Chang D. Yoo Korea Advanced Institute of Science and Technology Background Person Region Proposal

572 views • 16 slides

Using Backbone.js with Drupal 7 and 8 VADIM MIRGOROD Front-end, 05/22/2013 Building Bridges,

Using Backbone.js with Drupal 7 and 8 VADIM MIRGOROD Front-end, 05/22/2013 Building Bridges, Connecting Communities Vadim Mirgorod Lead developer in Trellon Writing Backbone.js CookBook for PACKT Email: dealancer@gmail.com Web:

900 views • 58 slides

Deep Factors for Forecasting Yuyang Wang, Alex Smola, Danielle C. Maddix, Jan Gasthaus, Dean

Deep Factors for Forecasting Yuyang Wang, Alex Smola, Danielle C. Maddix, Jan Gasthaus, Dean Foster, Tim Januschowski Time Series Prediction at Amazon weekly units servers shipped forecast and forecast and used years ahead Capacity

253 views • 13 slides

The backbone Rohalt Liebert technology A choice that Business Development Manager matters! Meet

The backbone Rohalt Liebert technology A choice that Business Development Manager matters! Meet us at Stand AK17 Hall 2-3 OTN Systems Masters in mission critical communications 1 OTN Systems Masters in mission critical communications 2

1.06k views • 21 slides

Local Backbones Ronald de Haan 1 , Iyad Kanj 2 , Stefan Szeider 1 1 Technische Universit at Wien

Local Backbones Ronald de Haan 1 , Iyad Kanj 2 , Stefan Szeider 1 1 Technische Universit at Wien 2 DePaul University SAT 2013 Backbones in propositional theories A backbone of a propositional theory is a variable that has the same truth value

564 views • 30 slides

Carrier Ethernet A Wave is Building Provider Backbone Bridges with Traffic Engineering

Carrier Ethernet A Wave is Building Provider Backbone Bridges with Traffic Engineering (PBB-TE) D. Kent Stevens Western Region Optical Architect kesteven@nortel.com 714-803-1050 Next Generation Packet Metro Ethernet Interface is the

446 views • 27 slides