beyond sequential decoding toward parallel decoding
play

Beyond Sequential decoding toward parallel decoding In the context - PowerPoint PPT Presentation

Beyond Sequential decoding toward parallel decoding In the context of neural sequence modelling Kyunghyun Cho New York University & Facebook AI Research Joint work with Jason Lee and Elman Mansimov Neural sequence modeling An arbitrary


  1. Beyond Sequential decoding toward parallel decoding In the context of neural sequence modelling Kyunghyun Cho New York University & Facebook AI Research Joint work with Jason Lee and Elman Mansimov

  2. Neural sequence modeling • An arbitrary input : e.g., another sequence, image, video, … X • A sequence output Y = ( y 1 , y 2 , . . . , y T ) • e.g., natural language sentence • Discrete y t ∈ V • Use a neural network to estimate a distribution over sequences ��� p θ ( Y | X ) • Machine translation, automatic speech recognition, …

  3. Neural autoregressive sequence modeling • Unlike classifica,on, complex, strong dependencies among ’s y t • More than half of residents in Korea speak ______. • Among millions of possible tokens, only one word (Korean) is likely above. • Neural autoregressive sequence modelling T X ��� p ( Y | X ) = ��� p ( y t | y <t , X ) t =1 <bos> It explicitly models dependencies <eos>

  4. Neural autoregressive sequence modeling • This has become a de facto standard in recent years • e.g., neural machine transla7on • Neural autoregressive sequence modelling T X ��� p ( Y | X ) = ��� p ( y t | y <t , X ) t =1 • Various op7ons: recurrent nets, convolu7onal nets, transformers y3 y4 y1 y2 X

  5. ��� ��� Neural autoregressive sequence modeling • Decoding is problema1c T X 1. Exact decoding is intractable ��� p ( y t | y <t , X ) =? Y t =1 2. Decoding is inherently sequen1al O ( kT ) y3 y4 y1 y2 X

  6. Neural no non- autoregressive sequence modeling • Conditional independence among ’s y t T T X X ��� p ( Y | X ) = ��� p ( y t | y <t , X ) ��� p ( Y | X ) = ��� p ( y t | X ) t =1 t =1 • Exact decoding is tractable y t = ��� ��� ˆ y t ��� p ( y t | X ) • Decoding is highly parallelizable y3 y4 y1 y2 X

  7. ��� Neural no non- autoregressive sequence modeling • Too good to be true : dependencies must be modelled somehow • Introduce a set of latent variables [Gu et al., 2018 ICLR] T X X ��� p ( Y | X ) = p ( y t | Z, X ) p ( Z | X ) t =1 Z y3 y4 y1 y2 Z X

  8. Neural no non- autoregressive sequence modeling • Repetition as a latent variable [Gu et al., 2018 ICLR] • Each latent variable : # of repetitions of the input symbol z t x t • | Z | = | X | =1 =3 y3 y4 z1 z2 y1 y2 x2 x2 x1 x2 x1 x2

  9. Neural no non- autoregressive sequence modeling • Repetition as a latent variable [Gu et al., 2018 ICLR] • Each latent variable : # of repetitions of the input symbol z t x t • Monte Carlo approximation with rescoring 1. , Y m = ��� ��� ��� p ( Y | Z m , X ) Z m ∼ Z | X Y 2. Pick with the high score by another model. Y m y3 y4 z1 z2 y1 y2 x2 x2 x1 x2 x1 x2

  10. Neural no non- autoregressive sequence modeling • Repetition as a latent variable [Gu et al., 2018 ICLR] • Each latent variable : # of repetitions of the input symbol z t x t • For training: use an auxiliary task to train p ( Z | X ) 1. Word alignment models y3 y4 z1 z2 y1 y2 x2 x2 x1 x2 x1 x2

  11. Neural no non- autoregressive sequence modeling • First convincing result!! [Gu et al., 2018 ICLR] • IWSLT’16 En→De Non- Decoding BLEU Sentence Autoregressive? Latency (ms) No Greedy 28.89 408ms Beam search (4) 29.70 607ms Yes argmax 25.20 39ms MC+Rescoring (10) 27.44 79ms MC+Rescoring (100) 28.16 257ms

  12. ��� Non-autoregressive modeling by iterative refinement [Lee, Mansimov & Cho, 2018] • Any alternative interpretation of the latent variables? T X X ��� p ( Y | X ) = p ( y t | Z, X ) p ( Z | X ) t =1 Z • Latent variables share the output semantics • They share the same vocabulary z t ∈ V, y t ∈ V • Multiple layers of the latent variables T ! T ! T ! X Y Y Y p ( y t | Z L , X ) p ( z L t | Z L − 1 , X ) p ( z 1 p ( Y | X ) = · · · t | X ) t =1 t =1 t =1 Z 1 ,...,Z L • Shared conditional distributions

  13. Non-autoregressive modeling by iterative refinement • Generative story: Iterative refinement Y l 1. Refine*: Generate an intermediate translation given a previous translation and the source sentence Y l − 1 X 2. Repeat 1 for iterations (or until convergence) L y1 y2 y4 y1 y2 y3 y4 y3 X * As the latent variables share the semantics with the output, we can use Z and Y exchangingly.

  14. Non-autoregressive modeling by iterative refinement a yellow bus parked on parked in of parking road . Y 1 a yellow and black on parked in a parking lot . Y 2 a yellow and black bus parked in a parking lot . Y 3 a yellow and black bus parked in a parking lot. Y 4 Input X

  15. Non-autoregressive modeling by iterative refinement • Training 1: lower-bound maximization • The output of each iteration is encouraged to be the correct answer y 1* y 2* y 3* y 4* y 1* y 2* y 3* y 4* CE CE CE CE CE CE CE CE y1 y2 y3 y4 y1 y2 y3 y4 X

  16. Non-autoregressive modeling by itera4ve refinement • Training 2: Conditional denoising autoencoder • A denoising autoencoder learns to hill climb [Alain & Bengio, 2013] ˆ • Y = ��� ( Y, X ) ��� p ( ˆ • Y | X ) ≥ ��� p ( Y | X ) y 1* y 2* y 3* y 4* CE CE CE CE Corruption Function C y1 y2 y3 y4 z1 z2 z4 z3 X

  17. Non-autoregressive modeling by iterative refinement • Lower-bound maximization & Conditional Denoising , • Mixed Training Objective • Consider L+1 iterations. • At each iteration, stochastically choose one of the two objectives. • Joint training from scratch

  18. Experiments – Machine Translation • En↔Ro (WMT’16): low-resource machine translation Non- Decoding En→Ro Ro→En Speed (toks/sec) Autoregressive? (BLEU) (BLEU) CPU GPU No Greedy 31.93 31.55 15.7 55.6 Beam (4) 32.40 32.06 7.3 43.3 Yes Iter 1 24.45 25.73 98.6 694.2 Iter 2 27.10 28.15 62.8 332.7 Iter 5 28.86 29.72 29.0 194.4 Iter 10 29.32 30.19 14.8 93.1 adaptive 29.66 30.30 16.5 226.6 • 91.5% translation quality with up to 4x decoding speed-up (on GPU)

  19. Experiments – Machine Translation • En↔De (WMT’15): moderate-scale machine translation Non- Decoding En→De De→En Speed (toks/sec) Autoregressive? (BLEU) (BLEU) CPU GPU No Greedy 23.40 26.49 15.8 53.6 Beam (4) 24.12 27.05 6.7 45.8 Yes Iter 1 12.65 14.84 101.2 536.5 Iter 2 15.03 17.15 56.4 407.1 Iter 5 17.53 20.02 27.1 173.4 Iter 10 18.48 21.10 13.1 87.8 adaptive 18.91 21.60 12.8 90.9 • 80% translation quality with up to 2x decoding speed-up (on GPU)

  20. Experiments – Machine Translation • Iterative refinement improves translation quality (almost) monotonically • intermediate latent variables (translations) are successfully capturing dependencies. • Quality degradation with large data • Significant speed-up in decoding on GPU • Fits well with GTC!

  21. Experiments – Machine Translation Sr Src : seitdem habe ich sieben Ha ̈ user in der Nachbarschaft mit den Lichtern versorgt und sie funktionierenen wirklich gut . Ite Iter 1 : and I ’ve been seven homes since in neighborhood with the lights and they ’re really functional . Ite Iter 4 : and I ’ve been seven homes in neighborhood with the lights , and they ’re a really functional . Ite Iter 8 : and I ’ve been providing seven homes in the neighborhood with the lights and they ’re a really functional . Ref : since now , I ’ve set up seven homes around my community , and they ’re really working . Re Sr Src : er sah sehr glu ̈ cklich aus , was damals ziemlich ungewo ̈ hnlich war , da ihn die Nachrichten meistens deprimierten . Ite Iter 1 : he looked very happy , which was pretty unusual the , because the news was were usually depressing . Ite Iter 4 : he looked very happy , which was pretty unusual at the , because news was mostly depressing . Ite Iter 8 : he looked very happy , which was pretty unusual at the time because the news was mostly depressing . Re Ref : there was a big smile on his face which was unusual then , because the news mostly depressed him .

  22. Experiments – Image Caption Generation • MS COCO: image caption generation Non- Decoding BLEU Speed (toks/sec) Autoregressive? CPU GPU No Greedy 23.47 4.3 2.1 Beam (4) 24.78 3.6 1.0 Yes Iter 1 20.12 17.1 8.9 Iter 2 20.88 12.0 5.7 Iter 5 21.12 6.2 2.8 Iter 10 21.24 2.0 1.2 adaptive 21.12 10.8 4.8 • 85% caption quality with up to 5x decoding speed-up (on GPU)

  23. Non-autoregressive modeling by iterative refinement a woman standing on playing tennis on a tennis racquet . Y 1 a woman standing on a tennis court a tennis racquet . Y 2 a woman standing on a tennis court a a racquet . Y 3 a woman standing on a tennis court holding a racquet . Y 4 Input X

  24. Conclusion • Latent variables capture output dependencies more efficiently. • Different interpreta8on → Different learning/decoding algorithms • Gu et al. [2018]: fer8lity → auxiliary supervision + noisy parallel decoding • Lee+Mansimov+Cho [2018]: itera8ve refinement → condi8onal denoising • Kaiser et al. [2018]: latent sequence → autoregressive inference • What else? • Genera8on quality closely tracks the autoregressive models’. • Decoding is significantly faster especially with GPU. • Poten8ally even faster decoding with a specialized hardware.

Recommend


More recommend