Hint-Based Training for Non-Autoregressive Translation Zhuohan Li - PowerPoint PPT Presentation

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin Liwei Wang Tie-Yan Liu Di He EMNLP-IJCNLP 2019

Autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN FFN FFN Encoder Multi-Head Context Encoder-to-Decoder Attention Masked Multi-head FFN FFN FFN Self Attention ×N Multi-Head Emb Emb Emb Emb Self Attention ×M <sos> y 1 y 2 y 3 Emb Emb Emb x 1 x 2 x 3 Hint-Based Training for Non-Autoregressive Translation

Non-autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN FFN FFN Encoder Multi-Head Context Encoder-to-Decoder Attention Multi-Head FFN FFN FFN Positional Attention Multi-Head Multi-Head Self Attention Self Attention ×N ×M Emb Emb Emb Emb Emb Emb Emb Copy x 1 x 2 x 3 Hint-Based Training for Non-Autoregressive Translation

Previous works on non-autoregressive MT Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Encoder Max Max Max Max Fertilities [Gu et al.] FFN FFN FFN FFN Multi-Head Context Encoder-to-Decoder Attention Multi-Head FFN FFN FFN Positional Attention Multi-Head Multi-Head Self Attention Self Attention ×N ×M ×R [Lee et al.] Emb Emb Emb Emb Emb Emb Emb x 1 x 2 x 3 Hint-Based Training for Non-Autoregressive Translation

Quality-speedup trade-off 18x Gu et al. 16x 14x 12x 10x Speedup Gu et al. 8x Kaiser et al. 6x 4x Lee et al. Gu et al. Autoregressive … 2x 0x 15 17 19 21 23 25 27 29 BLUE Score (WMT14 En-De) Hint-Based Training for Non-Autoregressive Translation

Hidden states similarity Autoregressive model Non-autoregressive model Hidden states cosine-similarity of a sampled sentence in IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation

Attention distribution Autoregressive model Non-autoregressive model Encoder-to-decoder attention distribution of an informative head of a sampled sentence from IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation

Hi Hint-bas based d trai traini ning ng from autoregressive teacher to non-autoregressive student Decoder Decoder y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 Soft Soft Soft Soft Soft Soft Soft Soft Max Max Max Max Max Max Max Max FFN FFN FFN FFN FFN FFN FFN FFN Encoder Encoder Hints Multi-Head Multi-Head Context Context Encoder-to-Decoder Attention Encoder-to-Decoder Attention Masked Multi-head Multi-Head FFN FFN FFN FFN FFN FFN Self Attention Positional Attention ×N Multi-Head Multi-Head Multi-Head Emb Emb Emb Emb Self Attention Self Attention Self Attention N× ×M M× <sos> y 1 y 2 y 3 Emb Emb Emb Emb Emb Emb Emb Emb Emb Emb Copy x 1 x 2 x 3 x 1 x 2 x 3 Autoregressive model Non-autoregressive model Hint-Based Training for Non-Autoregressive Translation

Hi Hint-bas based d trai traini ning ng from autoregressive teacher to non-autoregressive student Hints on hidden states Hints on word alignments Hint-Based Training for Non-Autoregressive Translation

Hints on hidden states • Directly regression fails because of the discrepancy between two models • We penalize hidden states that are highly similar: ( ! )' ( - ! 2 ℒ !"# = $ 𝑂 ( ( ( 𝜚(𝑒 %* , 𝑒 *. ) 𝑈 $ − 1 𝑈 %&' *&%+' ,&' • 𝑒 !" and 𝑒 "# are cosine similarities 𝑡 -th and 𝑢 -th of hidden states at layer 𝑚 of the student and teacher models • 𝜚 is a fixed function only penalizes when the student’s hidden states are similar while the teacher’s not Hint-Based Training for Non-Autoregressive Translation

Hints on word alignments • We minimize the KL-divergence between the per- head encoder-to-decoder attention distribution of the student and teacher models ( - 2 ! 1 %* ) *. ℒ /,"01 = $ 𝑂𝐼 ( ( ( 𝐸 34 (𝑏 *,,,! ∥ 𝑏 *,,,! 𝑈 *&' ,&%+' !&' Total Loss ℒ = ℒ 233 + 𝜇ℒ 456 + 𝜈ℒ 73582 Hint-Based Training for Non-Autoregressive Translation

Experimental settings Datasets Models Inference WMT14 En-De Non-autoregressive Transformer-base WMT14 De-En model Non-autoregressive IWSLT14 De-En Transformer-small model with teacher reranking Hint-Based Training for Non-Autoregressive Translation

Experimental results Hint-Based Training for Non-Autoregressive Translation

Quality-speedup trade-off 35x Ours 30x 25x Ours (with reranking) 20x Speedup Gu et al. 15x 10x Gu et al. Kaiser et al. 5x Gu et al. Lee et al. Autoregressive … 0x 15 17 19 21 23 25 27 29 BLUE Score (WMT14 En-De) Hint-Based Training for Non-Autoregressive Translation

Hidden states similarity Autoregressive model Non-autoregressive model Non-autoregressive model without hints with hints Hidden states cosine-similarity of a sampled sentence in IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation

Attention distribution Autoregressive model Non-autoregressive model Non-autoregressive model without hints with hints Encoder-to-decoder attention distribution of an informative head of a sampled sentence from IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation

Ablation studies Ablation studies on IWSLT14 De-En. Results are BLEU scores without teacher rescoring. Hint-Based Training for Non-Autoregressive Translation

Summary Instead of adding new modules that can slow down the model, we proposed a method to leverage the hints from the autoregressive model to help the training of the non- autoregressive model. Hint-Based Training for Non-Autoregressive Translation

Thanks! Q&A Zhuohan Li ( zhuohan@cs.berkeley.edu ) Hint-Based Training for Non-Autoregressive Translation

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li - PowerPoint PPT Presentation

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin Liwei Wang Tie-Yan Liu Di He EMNLP-IJCNLP 2019 Autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN

HINT: Reflowing T EX output T EX and the rest is a mixture of good and bad luck. HINT:

Chapter 4: Video 1 - Supplemental slides The Autoregressive Model Autoregressive (AR) processes

Lecture 12: Autoregressive Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall

Autoregressive Models Autoregressive Models In [1]: from mxnet import autograd, nd, gluon, init

T REASURE H UNT A PP Jon Frydman, Steven Lee, Eric Leong, HaKyung Yoon Hunt Hint Hint 2 I

introduction hint and puzzle HFLAV16 introduction hint and puzzle HFLAV16 realistic

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Community Translation By Willem Stoeller Examples Community Translation Virtual Teams Powering

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Global Translation Services Website translation using post-edited machine translation and

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

Adaptive Estimation of Autoregressive Models with Time-Varying Variances Ke-Li Xu and Peter C.

Agenda Automated Automated Modeling and Modeling and Forecasting Forecasting Vector Vector

Modeling Sub-Document Attention Using Viewport Time Max Grusky Jeiran Jahani Josh Schwartz Dan

PAY ATTENTION! Kate Naito, CPDT-KA Doggie Academy PURPOSE Teach your dog to pay attention:

Attention-based Learning for Missing Data Imputation in HoloClean Richard Wu 1 , A oqian Zhang 1 ,

The Attention Mechanism & Encoder-Decoder Variants CMSC 470 Marine Carpuat Introduction to

Attention Attention is the taking possession by the mind, in clear and vivid form, of one

Bradley/Taylor: Inadequate attention to and investment in services that address the broader

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image captioning and

Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang Zhang 2 , Fei Wu 1 , Tat-Seng

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li - PowerPoint PPT Presentation

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin Liwei Wang Tie-Yan Liu Di He EMNLP-IJCNLP 2019 Autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN

HINT: Reflowing T EX output T EX and the rest is a mixture of good and bad luck. HINT:

Chapter 4: Video 1 - Supplemental slides The Autoregressive Model Autoregressive (AR) processes

Lecture 12: Autoregressive Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall

Autoregressive Models Autoregressive Models In [1]: from mxnet import autograd, nd, gluon, init

T REASURE H UNT A PP Jon Frydman, Steven Lee, Eric Leong, HaKyung Yoon Hunt Hint Hint 2 I

introduction hint and puzzle HFLAV16 introduction hint and puzzle HFLAV16 realistic

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Community Translation By Willem Stoeller Examples Community Translation Virtual Teams Powering

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Global Translation Services Website translation using post-edited machine translation and

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

Adaptive Estimation of Autoregressive Models with Time-Varying Variances Ke-Li Xu and Peter C.

Agenda Automated Automated Modeling and Modeling and Forecasting Forecasting Vector Vector

Modeling Sub-Document Attention Using Viewport Time Max Grusky Jeiran Jahani Josh Schwartz Dan

PAY ATTENTION! Kate Naito, CPDT-KA Doggie Academy PURPOSE Teach your dog to pay attention:

Attention-based Learning for Missing Data Imputation in HoloClean Richard Wu 1 , A oqian Zhang 1 ,

The Attention Mechanism &amp; Encoder-Decoder Variants CMSC 470 Marine Carpuat Introduction to

Attention Attention is the taking possession by the mind, in clear and vivid form, of one

Bradley/Taylor: Inadequate attention to and investment in services that address the broader

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image captioning and

Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang Zhang 2 , Fei Wu 1 , Tat-Seng

The Attention Mechanism & Encoder-Decoder Variants CMSC 470 Marine Carpuat Introduction to