Linearly Augmented Deep Neural Network Pegah Ghahremani, Johns - PowerPoint PPT Presentation

Linearly Augmented Deep Neural Network Pegah Ghahremani, Johns Hopkins University, Jasha Droppo, Michael L. Seltzer, Microsoft Research

Typical DNN Architecture Softmax L4 • Layers are a composition of an affine and a non- linear function. Non-Linear 𝑔 𝑗 𝑦 = 𝜚(𝑋 𝑗 𝑦 + 𝑐 𝑗 ) • Typical DNN is a composition of several similar L3 Linear functions. • Can be trained from random initialization, but Non-Linear pre-training can help. L2 • Does DNN use all its capacity? Can we Linear reduce the model size? Non-Linear • Small amount of neurons are active L1 • High memory usage Linear Features

SVD-DNN Architecture • Alternating hourglass-linear and Softmax nonlinear blocks. L4 • Layer concept is the same as typical Non-Linear DNN. L3 • Training Linear • Train typical DNN. Non-Linear • Compress linear transformations with SVD. L2 • Fine tune the model to regain the lost accuracy. Linear • Can not be trained from random Non-Linear L1 initialization. Linear Features

SVD-DNN Architecture (alternate view) Softmax • Layers are composition of an affine, non-linear, and linear operation. Non-Linear L3 𝑔 𝑗 𝑦 = 𝑊 𝑗 𝜚 𝑉 𝑗 𝑦 + 𝑐 𝑗 • The SVD-DNN is a composition of several of these layers. Linear • Each layer: Non-Linear L2 • Maps the from one continuous vector space embedding to another. Linear • Is general function approximator. • Is very inefficient at representing a linear Non-Linear L1 transformation. Linear Features

LA-DNN Architecture Softmax • Similar to SVD-DNN • Augment each layer with a linear term. Non-Linear L3 𝑔 𝑗 𝑦 = 𝑊 𝑗 𝜚 𝑉 𝑗 𝑦 + 𝑐 𝑗 + 𝑈 𝑗 𝑦 • These layers: Linear • Use 𝑈 𝑗 to model any linear component of Non-Linear L2 the desired layer transformation. • Use 𝑊 𝑗 , 𝑉 𝑗 , 𝑐 𝑗 to model the non-linear residual. Linear • Posses greater modeling power, with a Non-Linear L1 similar parameter count. Linear Features

Network Type Comparison Train from Random Pre-training Compressed Notes Typical DNN Yes Available No Vanishing gradients Over-parameterized Large Model Unused Capacity SVD-DNN No Required Yes DNN approximation Smaller Model Difficult to train LA-DNN Yes Un-necessary Yes

LA-DNN Linear Component Parameters • Recall the formula for the LA-DNN layer: 𝑔 𝑗 𝑦 = 𝑊 𝑗 𝜚 𝑉 𝑗 𝑦 + 𝑐 𝑗 + 𝑈 𝑗 𝑦 • The matrix 𝑈 𝑗 can be Identity matrix. • Fewest number of parameters, least flexible. • It can be a full matrix. • Most flexible, but increases parameter count considerably. • It can be a diagonal matrix. • Balance between flexibility and parameter count. • Best configuration in our experiments.

LA-DNN Linear Component Values • Q: How does the network weight the linear component of its transform? • A: Lower transition weight for higher layers

TIMIT Results (baseline) • DNN-Sigmoid system size has been tuned to minimize TIMIT PER. • LA-DNN variants easily beat the tuned DNN system. • Better - Improvements in all metrics. • Faster - Drastically fewer parameters speeds training and evaluation. • Deeper - LA layers are able to benefit from deeper networks structure. Num of Training Training Validation Validation Model Layers Size # Params PER % H.Layers CE Frame Err % CE Frame Err DNN + Sigmoid 2 2048X2048 10.9M 0.66 21.39 1.23 37.67 23.63 LA-DNN + Sigmoid 6 1024X512 8M 0.61 20.5 1.18 35.8 22.28 LA-DNN+ReLU 6 1024X256 4.5M 0.54 18.6 1.22 35.5 22.08

TIMIT Results (Going Deeper) • Keeping parameter count well under the baseline (10.9M) • All metrics continue to improve – to at least forty-eight layers deep. LA-DNN with ReLU Units Training Validation Num of Layers Size # Params PER % Training Training Validation Validation H.Layers CE Frame Err % CE Frame Err 3 1024X256 2.9M 0.61 20.7 1.2 35.77 22.39 6 1024X256 4.5M 0.54 18.6 1.22 35.5 22.08 12 512X256 3.8M 0.55 19.2 1.21 35.5 21.8 24 256X256 3.5M 0.55 19.31 1.21 35.3 22.06 48 256X128 3.4M 0.56 19.5 1.21 35.4 21.7

AMI-HMI Results • DNN+Sigmoid WER increases if model size is reduced. • LA-DNN+Sigmoid beats DNN+Sigmoid with fewer parameters. • LA-DNN+ReLU beats DNN+ReLU with fewer parameters. Training Validation Num of WER Model Layers Size # Params Training Training Validation Validation H.Layers % CE Frame Err % CE Frame Err DNN+Sigmoid 6 2048X2048 37.6M 1.46 37.83 2.11 49.3 31.67 DNN+Sigmoid 6 1024X1024 12.5M 1.59 40.75 2.13 50.0 32.43 DNN+ReLU 6 1024X1024 12.5M 1.45 40.47 2.00 47.5 31.54 LA-DNN+Sigmoid 6 2048X512 18.4M 1.35 35.3 31.88 LA-DNN+ReLU 6 1024X512 10.5M 1.34 35.7 2.02 47.3 30.68

AMI-HMI Results (Going Deeper) • Deeper network, with fewer parameters, improves all validation metrics - To at least forty eight layers! • Larger 48 layer system is slightly better. LA-DNN with ReLU Units Training Validation Num of Layers Size # Params WER % Training Training Validation Validation H.Layers CE Frame Err % CE Frame Err 2048X512 12.1M 1.34 35.6 2.03 47.8 31.5 3 1024X512 10.5M 1.34 35.7 2.00 47.3 30.7 6 1024X256 8.9M 1.31 35.2 2.01 47.2 30.4 12 512X256 8.2M 1.34 35.7 1.99 47.2 30.2 24 256X256 7.9M 1.35 35.9 1.97 47.0 29.9 48 512X256 14M 1.25 33.9 2.00 46.7 29.7 48

Relation of LA-DNN to Pre-training • Do we really need deep architecture? • Complicated functions with high level abstraction (Bengio and Lecun 2007) • more complex functions given the same number of parameters • hierarchical representations • How to train a deep network using normal DNN? • Problems with training deeper network • Gradient vanishing • DNN solution => Unsupervised Pre-training • LA-DNN solution => Bypass connection

Relation of LA-DNN to Pre-training • What problem does pre-training really tackle?? • Pre-training initializes the network in a region of the smaller CE after parameter space that is: 1 st epoch • A better starting point for the non- convex optimization LA-DNN Converge after 20 • Easier for optimization epochs • Near better local optima • LA-DNN is naturally initialized to a good information-preserving, gradient-passing starting point.

Conclusion • Proposed a new layer structure for DNN. • Including “linear augmentation”: • Tackles gradient vanishing problem • Improve initial gradient computation and results in faster convergence. • Higher modeling capacity with fewer parameters. • Enables training truly deep networks. • Faster convergence, smaller network, better results.

BONUS SLIDES

DNN vs. LA-DNN 0.01 Average Learning Rate per sample • In a Basin of attraction of gradient descent corresponding to better 0.001 generalization performance 0.0001 • Smaller initial learning rate(LR) 0.00001 Baseline(2,2028) • DNN initial LR is 0.8:3.2 Augmented- • LA-DNN LR is 0.1:0.4 0.000001 Sigmoid(6X1024X512) Agmented -ReLU(6X1024X512) • Closer to the final solution in the region of parameter space and needs smaller step size. 0.0000001 2 7 12 17 22 27 32 37 Epoch

Linear Augmented Model • Better gradient for initial steps smaller CE after 1 st epoch • Faster convergence rate LA-DNN Converge after 20 epochs • Better error backpropagation • Better initial model

SDNN versus Deep stacking network (DSN) • In DSN, the input of layer l is the outputs of all previous layers stacked together. • In DSN, we increase layer dimension, specially in a very deep networks.

Linearly Augmented Deep Neural Network Pegah Ghahremani, Johns - PowerPoint PPT Presentation

Linearly Augmented Deep Neural Network Pegah Ghahremani, Johns Hopkins University, Jasha Droppo, Michael L. Seltzer, Microsoft Research Typical DNN Architecture Softmax L4 Layers are a composition of an affine and a non- linear function.

Network performance requirements of Augmented Reality Systems Mike P. Wittie 1 Augmented

# non-linearly. ! As height ( H ) increases, ( f ) decreases, $ % & non-linearly. As

IMPACT OF AUGMENTED REALITY ON SOCIETY BY DEREK MANDL AND STEPHEN SLADEK WHAT IS AUGMENTED

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Assimilation of Multiple Linearly Dependent Data Vectors Trond Mannseth NORCE Energy Linearly

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

1/08/2012 Augmented Reality How Does This Technology Fit in the Commercial World? Augmented

Portfolio of Work (9 pages) T H E N E X T R E V O L U T I O N I N R E T A I L AUGMENTED

ubiquitous computing and augmented realities virtual and augmented reality m aking the

AUGMENTED REALITY A complete overview of what augmented reality is and how it will revolutionize

Is Augmented Reality the Future? TJ VanToll (@tjvantoll) Augmented Reality TJ VanToll

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Tac acom oma P a Power an and Tran ansp sport rtation on E Electrification on Cam

Everybody Else Has Signed It. Whats Your Problem? Why Deep South Engineers Need to

Deep Learning Feature for Handwritten Keyword Spotting Baptiste Wicht Andreas Fischer Jean

Architectures that Scale Deep: Regaining Control in Deep Systems Ben Sigelman (@el_bhs,

Corporate Presentation January 2015 Future Oriented Information (See additional advisories at

Deep Work in the Era of Real-time Communications David Lavenda Raul Castaon-Martinez Senior

Im Improve Ca Care A Complimentary Webinar From healthsystemCIO.com Sponsored by SAS Your

Dumping of radioactive wastes at Cardiff Deep Tim Deere-Jones (Marine Radioactivity Research

Sambuz

Useful Links

Newsletter

Mail Us

Linearly Augmented Deep Neural Network Pegah Ghahremani, Johns - PowerPoint PPT Presentation

Linearly Augmented Deep Neural Network Pegah Ghahremani, Johns Hopkins University, Jasha Droppo, Michael L. Seltzer, Microsoft Research Typical DNN Architecture Softmax L4 Layers are a composition of an affine and a non- linear function.

Network performance requirements of Augmented Reality Systems Mike P. Wittie 1 Augmented

# non-linearly. ! As height ( H ) increases, ( f ) decreases, $ % &amp; non-linearly. As

IMPACT OF AUGMENTED REALITY ON SOCIETY BY DEREK MANDL AND STEPHEN SLADEK WHAT IS AUGMENTED

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Assimilation of Multiple Linearly Dependent Data Vectors Trond Mannseth NORCE Energy Linearly

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

1/08/2012 Augmented Reality How Does This Technology Fit in the Commercial World? Augmented

Portfolio of Work (9 pages) T H E N E X T R E V O L U T I O N I N R E T A I L AUGMENTED

ubiquitous computing and augmented realities virtual and augmented reality m aking the

AUGMENTED REALITY A complete overview of what augmented reality is and how it will revolutionize

Is Augmented Reality the Future? TJ VanToll (@tjvantoll) Augmented Reality TJ VanToll

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Tac acom oma P a Power an and Tran ansp sport rtation on E Electrification on Cam

Everybody Else Has Signed It. Whats Your Problem? Why Deep South Engineers Need to

Deep Learning Feature for Handwritten Keyword Spotting Baptiste Wicht Andreas Fischer Jean

Architectures that Scale Deep: Regaining Control in Deep Systems Ben Sigelman (@el_bhs,

Corporate Presentation January 2015 Future Oriented Information (See additional advisories at

Deep Work in the Era of Real-time Communications David Lavenda Raul Castaon-Martinez Senior

Im Improve Ca Care A Complimentary Webinar From healthsystemCIO.com Sponsored by SAS Your

Dumping of radioactive wastes at Cardiff Deep Tim Deere-Jones (Marine Radioactivity Research

Sambuz

Useful Links

Newsletter

Mail Us

# non-linearly. ! As height ( H ) increases, ( f ) decreases, $ % & non-linearly. As