Unfolding and Shrinking Neural Machine Translation Ensembles Felix - PowerPoint PPT Presentation

Unfolding and Shrinking Neural Machine Translation Ensembles Felix Stahlberg and Bill Byrne Department of Engineering Unfolding and Shrinking Neural Machine Translation Ensembles 1 Felix Stahlberg and Bill Byrne

Ensembling in neural machine translation Single model Ensembling Model 1 Model 2 Model Prediction Average Prediction Model 3 Model 4 Unfolding and Shrinking Neural Machine Translation Ensembles 2 Felix Stahlberg and Bill Byrne

Gains through ensembling WMT top systems (UEdin) WMT’16 (En-De) WMT’17 (En-De) Single 31.6 Single 26.6 +2.6 +1.7 BLEU BLEU Ensemble 34.2 Ensemble 28.3 http://matrix.statmt.org/ Google‘s NMT system WMT’14 (En-De) WMT’14 (En-Fr) Single 24.6 Single 40.0 +1.7 +1.2 BLEU BLEU Ensemble 26.3 Ensemble 41.2 Wu, Yonghui, et al. "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." arXiv preprint arXiv:1609.08144 (2016). Unfolding and Shrinking Neural Machine Translation Ensembles 3 Felix Stahlberg and Bill Byrne

Disadvantages of ensembling • Decoding with 𝑜 -ensembles is slow o More CPU/GPU switches o 𝑜 times more passes through the network at each decoding step o Applying softmax function 𝑜 more times at each decoding step • Ensembles are cumbersome o Often more difficult to implement Unfolding and Shrinking Neural Machine Translation Ensembles 4 Felix Stahlberg and Bill Byrne

Unfolding and shrinking Unfolding Shrinking Model 1 Model 2 Model Model Prediction Prediction Prediction Avg. Model 3 Model 4 Unfolding and Shrinking Neural Machine Translation Ensembles 5 Felix Stahlberg and Bill Byrne

Unfolding a single layer 𝑊 1 (𝑉 1 𝑉 2 ) 𝑊 2 Unfolding Unfolding and Shrinking Neural Machine Translation Ensembles 6 Felix Stahlberg and Bill Byrne

Unfolding multiple layers 𝑊 0 1 0 𝑊 2 𝑋 1 𝑋 2 (𝑉 1 𝑉 2 ) Unfolding Unfolding and Shrinking Neural Machine Translation Ensembles 7 Felix Stahlberg and Bill Byrne

Shrinking – wish list • Shrinking reduces the dimensionality of layers o Objective: Do not affect the behavior of the next layer • Remove whole neurons rather than individual weights o Smaller model and faster decoding o Network layout is the same, ie. inference code remains unchanged • Previous work is unsuitable o Weight pruning (LeCun et al., 1989; Hassibi et al., 1993; Han et al., 2015; See et al ., 2016; … ) o Approximating non-linear neurons with linear neurons (White, 2008) o Network compression methods based on low rank matrix factorization (Denil et al., 2013; Denton et al., 2014; Xue et al., 2013; Prabhavalkar et al., 2016; Lu et al., 2016; ...) Unfolding and Shrinking Neural Machine Translation Ensembles 8 Felix Stahlberg and Bill Byrne

Shrinking NMT (Bahdanau et al., 2015) networks Embedding layers: SVD-based shrinking Attention: Data-free shrinking GRU cells: Data-bound shrinking Unfolding and Shrinking Neural Machine Translation Ensembles 9 Felix Stahlberg and Bill Byrne

Shrinking linear layers with low-rank matrix factorization Previous layer Linear embedding layer (dimensionality to be reduced) Next layer 𝑌 ≈ 𝑉 ′ 𝑊 ′ ( 𝑉 ′ and 𝑊′ with low rank) 𝑉𝑊 = 𝑌 = ≈ 𝑉 𝑊 𝑌 𝑉′ 𝑊′ We use truncated SVD for the factorization Unfolding and Shrinking Neural Machine Translation Ensembles 11 Felix Stahlberg and Bill Byrne

Approximating a neuron with its most similar neighbor (Srinivas and Babu, 2015) Selection criterion: Small outgoing Similar incoming weights weights Unfolding and Shrinking Neural Machine Translation Ensembles 13 Felix Stahlberg and Bill Byrne

Approximating a neuron with a linear combination of its neighbors 𝑘 = 3 How to estimate 𝜇 ? Unfolding and Shrinking Neural Machine Translation Ensembles 14 Felix Stahlberg and Bill Byrne

Data-free and data-bound shrinking 𝑉 : Incoming weight matrix 𝜇 : Interpolation weights Data-free shrinking „ Approximate incoming weights “ Unfolding and Shrinking Neural Machine Translation Ensembles 15 Felix Stahlberg and Bill Byrne

Data-free and data-bound shrinking 𝑉 : Incoming weight matrix 𝜇 : Interpolation weights 𝐵 : Neuron activity matrix Data-free shrinking Data-bound shrinking „ Approximate incoming weights “ „ Directly approximate neuron activity “ Theory : Set the expected error Theory : Set the expected error introduced by shrinking to zero introduced by shrinking to zero by assuming a linear activation function. estimating the expected neuron activities with importance sampling. Unfolding and Shrinking Neural Machine Translation Ensembles 16 Felix Stahlberg and Bill Byrne

Shrinking layers to their original size (Japanese-English) Unfolding and Shrinking Neural Machine Translation Ensembles 17 Felix Stahlberg and Bill Byrne

Impact on BLEU of shrinking individual layers • Individual layers can be shrunk even below their original size • GRU layers are more sensitive to shrinking than embedding or attention layers Unfolding and Shrinking Neural Machine Translation Ensembles 18 Felix Stahlberg and Bill Byrne

Designing three setups for Japanese-English Layer sizes Unfolding and Shrinking Neural Machine Translation Ensembles 19 Felix Stahlberg and Bill Byrne

Designing three setups for Japanese-English (Unbatched) GPU decoding speed is roughly constant after unfolding, but shrinking makes batching more effective Unfolding and Shrinking Neural Machine Translation Ensembles 20 Felix Stahlberg and Bill Byrne

Conclusion • Unfolding yields ensemble level performance with a single network o Often faster and easier to deploy • Shrinking can reduce the size of unfolded networks significantly o Depending on the aggressiveness of pruning, shrinking+unfolding yields either  +2.2 BLEU at the same decoding speed or  3.4 times CPU speed up with a minor drop in BLEU • Our work indicates huge amounts of wasted computation o High dimensional embedding and attention layers may be needed for training, but are not necessary for inference Unfolding and Shrinking Neural Machine Translation Ensembles 21 Felix Stahlberg and Bill Byrne

References • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR. Toulon, France. • Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. 2013. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems. pages 2148 – 2156. • Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems. pages 1269 – 1277. • Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. pages 1135 – 1143. • Babak Hassibi, David G. Stork, et al. 1993. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems pages 164 – 164. • Yann LeCun, John S. Denker, Sara A. Solla, Richard E. Howard, and Lawrence D. Jackel. 1989. Optimal brain damage. In NIPS. volume 2, pages 598 – 605. • Zhiyun Lu, Vikas Sindhwani, and Tara N. Sainath. 2016. Learning compact recurrent neural networks. In ICASSP, pages 5960 – 5964. • Rohit Prabhavalkar, Ouais Alsharif, Antoine Bruguier, and Lan McGraw. 2016. On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition. In ICASSP, pages 5970 – 5974. • Abigail See, Minh-Thang Luong, and Christopher D. Manning. 2016. Compression of neural machine translation models via pruning. CoNLL 2016 pages 291 – 299. • Suraj Srinivas and R. Venkatesh Babu. 2015. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149 . • Jian Xue, Jinyu Li, and Yifan Gong. 2013. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech. pages 2365 – 2369. • White, Halbert . “Learning in artificial neural networks: A statistical perspective.” Learning 1.4 (2008) Unfolding and Shrinking Neural Machine Translation Ensembles 22 Felix Stahlberg and Bill Byrne

Thanks Unfolding and Shrinking Neural Machine Translation Ensembles 23 Felix Stahlberg and Bill Byrne

Unfolding and Shrinking Neural Machine Translation Ensembles Felix - PowerPoint PPT Presentation

Unfolding and Shrinking Neural Machine Translation Ensembles Felix Stahlberg and Bill Byrne Department of Engineering Unfolding and Shrinking Neural Machine Translation Ensembles 1 Felix Stahlberg and Bill Byrne Ensembling in neural machine

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Perceptrons From the heights of error, To the valleys of Truth Piyush Kumar Advanced

the Company of Heaven Gods Invisible Creatures Angels are ministering spirits (Hebrews

Wireless Networks Lecture 20 : Managing Wireless Networks Peter Steenkiste CS and ECE, Carnegie

Tcl Values: Past, Present & Tales from the Future 2016 Tcl Conference Don Porter Tcl/Tk

Nuclear effects in high energy lepton-nucleus scattering Vadim Guzey Theory Center, Jefferson

Physics 2D Lecture Slides Lecture 18: Feb 9th 2005 Vivek Sharma UCSD Physics Wave Packets

? Foundations of Particle Physics Workshop University of Michigan 11 March 2018 CMS 2 CMS

Unfolding and Shrinking Neural Machine Translation Ensembles Felix - PowerPoint PPT Presentation

Unfolding and Shrinking Neural Machine Translation Ensembles Felix Stahlberg and Bill Byrne Department of Engineering Unfolding and Shrinking Neural Machine Translation Ensembles 1 Felix Stahlberg and Bill Byrne Ensembling in neural machine

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Perceptrons From the heights of error, To the valleys of Truth Piyush Kumar Advanced

the Company of Heaven Gods Invisible Creatures Angels are ministering spirits (Hebrews

Wireless Networks Lecture 20 : Managing Wireless Networks Peter Steenkiste CS and ECE, Carnegie

Tcl Values: Past, Present &amp; Tales from the Future 2016 Tcl Conference Don Porter Tcl/Tk

Nuclear effects in high energy lepton-nucleus scattering Vadim Guzey Theory Center, Jefferson

Physics 2D Lecture Slides Lecture 18: Feb 9th 2005 Vivek Sharma UCSD Physics Wave Packets

? Foundations of Particle Physics Workshop University of Michigan 11 March 2018 CMS 2 CMS

Tcl Values: Past, Present & Tales from the Future 2016 Tcl Conference Don Porter Tcl/Tk