Beyond Sequential decoding toward parallel decoding In the context of neural sequence modelling Kyunghyun Cho New York University & Facebook AI Research Joint work with Jason Lee and Elman Mansimov
Neural sequence modeling • An arbitrary input : e.g., another sequence, image, video, … X • A sequence output Y = ( y 1 , y 2 , . . . , y T ) • e.g., natural language sentence • Discrete y t ∈ V • Use a neural network to estimate a distribution over sequences ��� p θ ( Y | X ) • Machine translation, automatic speech recognition, …
Neural autoregressive sequence modeling • Unlike classifica,on, complex, strong dependencies among ’s y t • More than half of residents in Korea speak ______. • Among millions of possible tokens, only one word (Korean) is likely above. • Neural autoregressive sequence modelling T X ��� p ( Y | X ) = ��� p ( y t | y <t , X ) t =1 <bos> It explicitly models dependencies <eos>
Neural autoregressive sequence modeling • This has become a de facto standard in recent years • e.g., neural machine transla7on • Neural autoregressive sequence modelling T X ��� p ( Y | X ) = ��� p ( y t | y <t , X ) t =1 • Various op7ons: recurrent nets, convolu7onal nets, transformers y3 y4 y1 y2 X
��� ��� Neural autoregressive sequence modeling • Decoding is problema1c T X 1. Exact decoding is intractable ��� p ( y t | y <t , X ) =? Y t =1 2. Decoding is inherently sequen1al O ( kT ) y3 y4 y1 y2 X
Neural no non- autoregressive sequence modeling • Conditional independence among ’s y t T T X X ��� p ( Y | X ) = ��� p ( y t | y <t , X ) ��� p ( Y | X ) = ��� p ( y t | X ) t =1 t =1 • Exact decoding is tractable y t = ��� ��� ˆ y t ��� p ( y t | X ) • Decoding is highly parallelizable y3 y4 y1 y2 X
��� Neural no non- autoregressive sequence modeling • Too good to be true : dependencies must be modelled somehow • Introduce a set of latent variables [Gu et al., 2018 ICLR] T X X ��� p ( Y | X ) = p ( y t | Z, X ) p ( Z | X ) t =1 Z y3 y4 y1 y2 Z X
Neural no non- autoregressive sequence modeling • Repetition as a latent variable [Gu et al., 2018 ICLR] • Each latent variable : # of repetitions of the input symbol z t x t • | Z | = | X | =1 =3 y3 y4 z1 z2 y1 y2 x2 x2 x1 x2 x1 x2
Neural no non- autoregressive sequence modeling • Repetition as a latent variable [Gu et al., 2018 ICLR] • Each latent variable : # of repetitions of the input symbol z t x t • Monte Carlo approximation with rescoring 1. , Y m = ��� ��� ��� p ( Y | Z m , X ) Z m ∼ Z | X Y 2. Pick with the high score by another model. Y m y3 y4 z1 z2 y1 y2 x2 x2 x1 x2 x1 x2
Neural no non- autoregressive sequence modeling • Repetition as a latent variable [Gu et al., 2018 ICLR] • Each latent variable : # of repetitions of the input symbol z t x t • For training: use an auxiliary task to train p ( Z | X ) 1. Word alignment models y3 y4 z1 z2 y1 y2 x2 x2 x1 x2 x1 x2
Neural no non- autoregressive sequence modeling • First convincing result!! [Gu et al., 2018 ICLR] • IWSLT’16 En→De Non- Decoding BLEU Sentence Autoregressive? Latency (ms) No Greedy 28.89 408ms Beam search (4) 29.70 607ms Yes argmax 25.20 39ms MC+Rescoring (10) 27.44 79ms MC+Rescoring (100) 28.16 257ms
��� Non-autoregressive modeling by iterative refinement [Lee, Mansimov & Cho, 2018] • Any alternative interpretation of the latent variables? T X X ��� p ( Y | X ) = p ( y t | Z, X ) p ( Z | X ) t =1 Z • Latent variables share the output semantics • They share the same vocabulary z t ∈ V, y t ∈ V • Multiple layers of the latent variables T ! T ! T ! X Y Y Y p ( y t | Z L , X ) p ( z L t | Z L − 1 , X ) p ( z 1 p ( Y | X ) = · · · t | X ) t =1 t =1 t =1 Z 1 ,...,Z L • Shared conditional distributions
Non-autoregressive modeling by iterative refinement • Generative story: Iterative refinement Y l 1. Refine*: Generate an intermediate translation given a previous translation and the source sentence Y l − 1 X 2. Repeat 1 for iterations (or until convergence) L y1 y2 y4 y1 y2 y3 y4 y3 X * As the latent variables share the semantics with the output, we can use Z and Y exchangingly.
Non-autoregressive modeling by iterative refinement a yellow bus parked on parked in of parking road . Y 1 a yellow and black on parked in a parking lot . Y 2 a yellow and black bus parked in a parking lot . Y 3 a yellow and black bus parked in a parking lot. Y 4 Input X
Non-autoregressive modeling by iterative refinement • Training 1: lower-bound maximization • The output of each iteration is encouraged to be the correct answer y 1* y 2* y 3* y 4* y 1* y 2* y 3* y 4* CE CE CE CE CE CE CE CE y1 y2 y3 y4 y1 y2 y3 y4 X
Non-autoregressive modeling by itera4ve refinement • Training 2: Conditional denoising autoencoder • A denoising autoencoder learns to hill climb [Alain & Bengio, 2013] ˆ • Y = ��� ( Y, X ) ��� p ( ˆ • Y | X ) ≥ ��� p ( Y | X ) y 1* y 2* y 3* y 4* CE CE CE CE Corruption Function C y1 y2 y3 y4 z1 z2 z4 z3 X
Non-autoregressive modeling by iterative refinement • Lower-bound maximization & Conditional Denoising , • Mixed Training Objective • Consider L+1 iterations. • At each iteration, stochastically choose one of the two objectives. • Joint training from scratch
Experiments – Machine Translation • En↔Ro (WMT’16): low-resource machine translation Non- Decoding En→Ro Ro→En Speed (toks/sec) Autoregressive? (BLEU) (BLEU) CPU GPU No Greedy 31.93 31.55 15.7 55.6 Beam (4) 32.40 32.06 7.3 43.3 Yes Iter 1 24.45 25.73 98.6 694.2 Iter 2 27.10 28.15 62.8 332.7 Iter 5 28.86 29.72 29.0 194.4 Iter 10 29.32 30.19 14.8 93.1 adaptive 29.66 30.30 16.5 226.6 • 91.5% translation quality with up to 4x decoding speed-up (on GPU)
Experiments – Machine Translation • En↔De (WMT’15): moderate-scale machine translation Non- Decoding En→De De→En Speed (toks/sec) Autoregressive? (BLEU) (BLEU) CPU GPU No Greedy 23.40 26.49 15.8 53.6 Beam (4) 24.12 27.05 6.7 45.8 Yes Iter 1 12.65 14.84 101.2 536.5 Iter 2 15.03 17.15 56.4 407.1 Iter 5 17.53 20.02 27.1 173.4 Iter 10 18.48 21.10 13.1 87.8 adaptive 18.91 21.60 12.8 90.9 • 80% translation quality with up to 2x decoding speed-up (on GPU)
Experiments – Machine Translation • Iterative refinement improves translation quality (almost) monotonically • intermediate latent variables (translations) are successfully capturing dependencies. • Quality degradation with large data • Significant speed-up in decoding on GPU • Fits well with GTC!
Experiments – Machine Translation Sr Src : seitdem habe ich sieben Ha ̈ user in der Nachbarschaft mit den Lichtern versorgt und sie funktionierenen wirklich gut . Ite Iter 1 : and I ’ve been seven homes since in neighborhood with the lights and they ’re really functional . Ite Iter 4 : and I ’ve been seven homes in neighborhood with the lights , and they ’re a really functional . Ite Iter 8 : and I ’ve been providing seven homes in the neighborhood with the lights and they ’re a really functional . Ref : since now , I ’ve set up seven homes around my community , and they ’re really working . Re Sr Src : er sah sehr glu ̈ cklich aus , was damals ziemlich ungewo ̈ hnlich war , da ihn die Nachrichten meistens deprimierten . Ite Iter 1 : he looked very happy , which was pretty unusual the , because the news was were usually depressing . Ite Iter 4 : he looked very happy , which was pretty unusual at the , because news was mostly depressing . Ite Iter 8 : he looked very happy , which was pretty unusual at the time because the news was mostly depressing . Re Ref : there was a big smile on his face which was unusual then , because the news mostly depressed him .
Experiments – Image Caption Generation • MS COCO: image caption generation Non- Decoding BLEU Speed (toks/sec) Autoregressive? CPU GPU No Greedy 23.47 4.3 2.1 Beam (4) 24.78 3.6 1.0 Yes Iter 1 20.12 17.1 8.9 Iter 2 20.88 12.0 5.7 Iter 5 21.12 6.2 2.8 Iter 10 21.24 2.0 1.2 adaptive 21.12 10.8 4.8 • 85% caption quality with up to 5x decoding speed-up (on GPU)
Non-autoregressive modeling by iterative refinement a woman standing on playing tennis on a tennis racquet . Y 1 a woman standing on a tennis court a tennis racquet . Y 2 a woman standing on a tennis court a a racquet . Y 3 a woman standing on a tennis court holding a racquet . Y 4 Input X
Conclusion • Latent variables capture output dependencies more efficiently. • Different interpreta8on → Different learning/decoding algorithms • Gu et al. [2018]: fer8lity → auxiliary supervision + noisy parallel decoding • Lee+Mansimov+Cho [2018]: itera8ve refinement → condi8onal denoising • Kaiser et al. [2018]: latent sequence → autoregressive inference • What else? • Genera8on quality closely tracks the autoregressive models’. • Decoding is significantly faster especially with GPU. • Poten8ally even faster decoding with a specialized hardware.
Recommend
More recommend