Beyond Sequential decoding toward parallel decoding In the context - PowerPoint PPT Presentation

Beyond Sequential decoding toward parallel decoding In the context of neural sequence modelling Kyunghyun Cho New York University & Facebook AI Research Joint work with Jason Lee and Elman Mansimov

Neural sequence modeling • An arbitrary input : e.g., another sequence, image, video, … X • A sequence output Y = ( y 1 , y 2 , . . . , y T ) • e.g., natural language sentence • Discrete y t ∈ V • Use a neural network to estimate a distribution over sequences �� p θ ( Y | X ) • Machine translation, automatic speech recognition, …

Neural autoregressive sequence modeling • Unlike classifica,on, complex, strong dependencies among ’s y t • More than half of residents in Korea speak ______. • Among millions of possible tokens, only one word (Korean) is likely above. • Neural autoregressive sequence modelling T X �� p ( Y | X ) = �� p ( y t | y <t , X ) t =1 <bos> It explicitly models dependencies <eos>

Neural autoregressive sequence modeling • This has become a de facto standard in recent years • e.g., neural machine transla7on • Neural autoregressive sequence modelling T X �� p ( Y | X ) = �� p ( y t | y <t , X ) t =1 • Various op7ons: recurrent nets, convolu7onal nets, transformers y3 y4 y1 y2 X

�� Neural autoregressive sequence modeling • Decoding is problema1c T X 1. Exact decoding is intractable �� p ( y t | y <t , X ) =? Y t =1 2. Decoding is inherently sequen1al O ( kT ) y3 y4 y1 y2 X

Neural no non- autoregressive sequence modeling • Conditional independence among ’s y t T T X X �� p ( Y | X ) = �� p ( y t | y <t , X ) �� p ( Y | X ) = �� p ( y t | X ) t =1 t =1 • Exact decoding is tractable y t = �� ˆ y t �� p ( y t | X ) • Decoding is highly parallelizable y3 y4 y1 y2 X

�� Neural no non- autoregressive sequence modeling • Too good to be true : dependencies must be modelled somehow • Introduce a set of latent variables [Gu et al., 2018 ICLR] T X X �� p ( Y | X ) = p ( y t | Z, X ) p ( Z | X ) t =1 Z y3 y4 y1 y2 Z X

Neural no non- autoregressive sequence modeling • Repetition as a latent variable [Gu et al., 2018 ICLR] • Each latent variable : # of repetitions of the input symbol z t x t • | Z | = | X | =1 =3 y3 y4 z1 z2 y1 y2 x2 x2 x1 x2 x1 x2

Neural no non- autoregressive sequence modeling • Repetition as a latent variable [Gu et al., 2018 ICLR] • Each latent variable : # of repetitions of the input symbol z t x t • Monte Carlo approximation with rescoring 1. , Y m = �� p ( Y | Z m , X ) Z m ∼ Z | X Y 2. Pick with the high score by another model. Y m y3 y4 z1 z2 y1 y2 x2 x2 x1 x2 x1 x2

Neural no non- autoregressive sequence modeling • Repetition as a latent variable [Gu et al., 2018 ICLR] • Each latent variable : # of repetitions of the input symbol z t x t • For training: use an auxiliary task to train p ( Z | X ) 1. Word alignment models y3 y4 z1 z2 y1 y2 x2 x2 x1 x2 x1 x2

Neural no non- autoregressive sequence modeling • First convincing result!! [Gu et al., 2018 ICLR] • IWSLT’16 En→De Non- Decoding BLEU Sentence Autoregressive? Latency (ms) No Greedy 28.89 408ms Beam search (4) 29.70 607ms Yes argmax 25.20 39ms MC+Rescoring (10) 27.44 79ms MC+Rescoring (100) 28.16 257ms

�� Non-autoregressive modeling by iterative refinement [Lee, Mansimov & Cho, 2018] • Any alternative interpretation of the latent variables? T X X �� p ( Y | X ) = p ( y t | Z, X ) p ( Z | X ) t =1 Z • Latent variables share the output semantics • They share the same vocabulary z t ∈ V, y t ∈ V • Multiple layers of the latent variables T ! T ! T ! X Y Y Y p ( y t | Z L , X ) p ( z L t | Z L − 1 , X ) p ( z 1 p ( Y | X ) = · · · t | X ) t =1 t =1 t =1 Z 1 ,...,Z L • Shared conditional distributions

Non-autoregressive modeling by iterative refinement • Generative story: Iterative refinement Y l 1. Refine*: Generate an intermediate translation given a previous translation and the source sentence Y l − 1 X 2. Repeat 1 for iterations (or until convergence) L y1 y2 y4 y1 y2 y3 y4 y3 X * As the latent variables share the semantics with the output, we can use Z and Y exchangingly.

Non-autoregressive modeling by iterative refinement a yellow bus parked on parked in of parking road . Y 1 a yellow and black on parked in a parking lot . Y 2 a yellow and black bus parked in a parking lot . Y 3 a yellow and black bus parked in a parking lot. Y 4 Input X

Non-autoregressive modeling by iterative refinement • Training 1: lower-bound maximization • The output of each iteration is encouraged to be the correct answer y 1* y 2* y 3* y 4* y 1* y 2* y 3* y 4* CE CE CE CE CE CE CE CE y1 y2 y3 y4 y1 y2 y3 y4 X

Non-autoregressive modeling by itera4ve refinement • Training 2: Conditional denoising autoencoder • A denoising autoencoder learns to hill climb [Alain & Bengio, 2013] ˆ • Y = �� ( Y, X ) �� p ( ˆ • Y | X ) ≥ �� p ( Y | X ) y 1* y 2* y 3* y 4* CE CE CE CE Corruption Function C y1 y2 y3 y4 z1 z2 z4 z3 X

Non-autoregressive modeling by iterative refinement • Lower-bound maximization & Conditional Denoising , • Mixed Training Objective • Consider L+1 iterations. • At each iteration, stochastically choose one of the two objectives. • Joint training from scratch

Experiments – Machine Translation • En↔Ro (WMT’16): low-resource machine translation Non- Decoding En→Ro Ro→En Speed (toks/sec) Autoregressive? (BLEU) (BLEU) CPU GPU No Greedy 31.93 31.55 15.7 55.6 Beam (4) 32.40 32.06 7.3 43.3 Yes Iter 1 24.45 25.73 98.6 694.2 Iter 2 27.10 28.15 62.8 332.7 Iter 5 28.86 29.72 29.0 194.4 Iter 10 29.32 30.19 14.8 93.1 adaptive 29.66 30.30 16.5 226.6 • 91.5% translation quality with up to 4x decoding speed-up (on GPU)

Experiments – Machine Translation • En↔De (WMT’15): moderate-scale machine translation Non- Decoding En→De De→En Speed (toks/sec) Autoregressive? (BLEU) (BLEU) CPU GPU No Greedy 23.40 26.49 15.8 53.6 Beam (4) 24.12 27.05 6.7 45.8 Yes Iter 1 12.65 14.84 101.2 536.5 Iter 2 15.03 17.15 56.4 407.1 Iter 5 17.53 20.02 27.1 173.4 Iter 10 18.48 21.10 13.1 87.8 adaptive 18.91 21.60 12.8 90.9 • 80% translation quality with up to 2x decoding speed-up (on GPU)

Experiments – Machine Translation • Iterative refinement improves translation quality (almost) monotonically • intermediate latent variables (translations) are successfully capturing dependencies. • Quality degradation with large data • Significant speed-up in decoding on GPU • Fits well with GTC!

Experiments – Machine Translation Sr Src : seitdem habe ich sieben Ha ̈ user in der Nachbarschaft mit den Lichtern versorgt und sie funktionierenen wirklich gut . Ite Iter 1 : and I ’ve been seven homes since in neighborhood with the lights and they ’re really functional . Ite Iter 4 : and I ’ve been seven homes in neighborhood with the lights , and they ’re a really functional . Ite Iter 8 : and I ’ve been providing seven homes in the neighborhood with the lights and they ’re a really functional . Ref : since now , I ’ve set up seven homes around my community , and they ’re really working . Re Sr Src : er sah sehr glu ̈ cklich aus , was damals ziemlich ungewo ̈ hnlich war , da ihn die Nachrichten meistens deprimierten . Ite Iter 1 : he looked very happy , which was pretty unusual the , because the news was were usually depressing . Ite Iter 4 : he looked very happy , which was pretty unusual at the , because news was mostly depressing . Ite Iter 8 : he looked very happy , which was pretty unusual at the time because the news was mostly depressing . Re Ref : there was a big smile on his face which was unusual then , because the news mostly depressed him .

Experiments – Image Caption Generation • MS COCO: image caption generation Non- Decoding BLEU Speed (toks/sec) Autoregressive? CPU GPU No Greedy 23.47 4.3 2.1 Beam (4) 24.78 3.6 1.0 Yes Iter 1 20.12 17.1 8.9 Iter 2 20.88 12.0 5.7 Iter 5 21.12 6.2 2.8 Iter 10 21.24 2.0 1.2 adaptive 21.12 10.8 4.8 • 85% caption quality with up to 5x decoding speed-up (on GPU)

Non-autoregressive modeling by iterative refinement a woman standing on playing tennis on a tennis racquet . Y 1 a woman standing on a tennis court a tennis racquet . Y 2 a woman standing on a tennis court a a racquet . Y 3 a woman standing on a tennis court holding a racquet . Y 4 Input X

Conclusion • Latent variables capture output dependencies more efficiently. • Different interpreta8on → Different learning/decoding algorithms • Gu et al. [2018]: fer8lity → auxiliary supervision + noisy parallel decoding • Lee+Mansimov+Cho [2018]: itera8ve refinement → condi8onal denoising • Kaiser et al. [2018]: latent sequence → autoregressive inference • What else? • Genera8on quality closely tracks the autoregressive models’. • Decoding is significantly faster especially with GPU. • Poten8ally even faster decoding with a specialized hardware.

Beyond Sequential decoding toward parallel decoding In the context - PowerPoint PPT Presentation

Beyond Sequential decoding toward parallel decoding In the context of neural sequence modelling Kyunghyun Cho New York University & Facebook AI Research Joint work with Jason Lee and Elman Mansimov Neural sequence modeling An arbitrary

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Random Sampling Florian Schoppmann August 24, 2010 Non-Sequential Sequential Sequential with

Hardware Design with VHDL Sequential Stmts ECE 443 Sequential Statements This slide set covers

Sequential Files : Outline ! Overview ! Ordered vs. Unordered ! Physical sequential Files !

By et al Siegfried Engelmann Decoding Strategies: Decoding B1- Teacher's Presentation Book

Decoding Philipp Koehn 17 September 2020 Philipp Koehn Machine Translation: Decoding 17

Chapter 5 Synchronous Sequential Logic 5-1 Outline ! Sequential Circuits ! Latches ! Flip-Flops

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Introduction to Synchronous Sequential Introduction to Synchronous Sequential Circuits Circuits

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for

List Decoding of Algebraic Codes Peter Beelen, Kristian Brander and Johan S.R. Nielsen DTU

Observation Decoding with Sensor Models: Recognition Tasks via Classical Planning Diego Aineto,

Hardware Design with VHDL Sequential Circuit Design I ECE 443 Sequential Circuit Design:

Sequential Circuits Combinational circuits : current input output Sequential circuit :

THE 50/50 DUAL LANGUAGE IMMERSION MODEL Learning language through content One-way vs. two-way

GDEX FOR SLOVENE Iztok Kosem Trojina, Institute for Applied Slovene Studies & Faculty of

Quantifying Object- and Command-oriented Interaction Alix Goguey 1 , Julie Wagner 2 , Gry Casiez

IN MEXICO ( EL ESTADO DE NIMO DE LOS TUITEROS EN MXICO) Gera rard rdo Leyva va Octo

Neuroplasticity and Opening the Door to Hope The Following Power Point Presentation can be used as

Presenter Don Lewis, Ph.D., Principal, Lewis Consulting email: dlewis@consultlewis.com phone:

Multi-Asset Gold Producer April 2020 TSX: ROXG Cautionary Statement This presentation contains

Presentation of Gtechniq EXO Ultra Durable Hybrid Coating Comparison with Gtechniq C1 Crystal

Sambuz

Useful Links

Newsletter

Mail Us

Beyond Sequential decoding toward parallel decoding In the context - PowerPoint PPT Presentation

Beyond Sequential decoding toward parallel decoding In the context of neural sequence modelling Kyunghyun Cho New York University & Facebook AI Research Joint work with Jason Lee and Elman Mansimov Neural sequence modeling An arbitrary

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Random Sampling Florian Schoppmann August 24, 2010 Non-Sequential Sequential Sequential with

Hardware Design with VHDL Sequential Stmts ECE 443 Sequential Statements This slide set covers

Sequential Files : Outline ! Overview ! Ordered vs. Unordered ! Physical sequential Files !

By et al Siegfried Engelmann Decoding Strategies: Decoding B1- Teacher's Presentation Book

Decoding Philipp Koehn 17 September 2020 Philipp Koehn Machine Translation: Decoding 17

Chapter 5 Synchronous Sequential Logic 5-1 Outline ! Sequential Circuits ! Latches ! Flip-Flops

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Introduction to Synchronous Sequential Introduction to Synchronous Sequential Circuits Circuits

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for

List Decoding of Algebraic Codes Peter Beelen, Kristian Brander and Johan S.R. Nielsen DTU

Observation Decoding with Sensor Models: Recognition Tasks via Classical Planning Diego Aineto,

Hardware Design with VHDL Sequential Circuit Design I ECE 443 Sequential Circuit Design:

Sequential Circuits Combinational circuits : current input output Sequential circuit :

THE 50/50 DUAL LANGUAGE IMMERSION MODEL Learning language through content One-way vs. two-way

GDEX FOR SLOVENE Iztok Kosem Trojina, Institute for Applied Slovene Studies &amp; Faculty of

Quantifying Object- and Command-oriented Interaction Alix Goguey 1 , Julie Wagner 2 , Gry Casiez

IN MEXICO ( EL ESTADO DE NIMO DE LOS TUITEROS EN MXICO) Gera rard rdo Leyva va Octo

Neuroplasticity and Opening the Door to Hope The Following Power Point Presentation can be used as

Presenter Don Lewis, Ph.D., Principal, Lewis Consulting email: dlewis@consultlewis.com phone:

Multi-Asset Gold Producer April 2020 TSX: ROXG Cautionary Statement This presentation contains

Presentation of Gtechniq EXO Ultra Durable Hybrid Coating Comparison with Gtechniq C1 Crystal

Sambuz

Useful Links

Newsletter

Mail Us

GDEX FOR SLOVENE Iztok Kosem Trojina, Institute for Applied Slovene Studies & Faculty of