Parameterised Sigmoid and ReLU Hidden Activation Functions for DNN - PowerPoint PPT Presentation

Parameterised Sigmoid and ReLU Hidden Activation Functions for DNN Acoustic Modelling Chao Zhang & Phil Woodland University of Cambridge 29 April 2015

Introduction • The hidden activation function plays an important role in deep learning: Pretraining (PT) & Finetuning (FT), ReLU. • Recent studies on learning parameterised activation functions resulted in improved performance. • We study the parameterised forms of Sigmoid ( p − Sigmoid) and ReLU ( p − ReLU) functions for SI DNN acoustic model training: 2 of 13

Parameterised Sigmoid Function • The generalised form of Sigmoid, or the logistic function , is 1 f i ( a i ) = η i · 1 + e − γ i a i + θ i • η i , γ i , and θ i have different effects on f i ( a i ): ◦ η i defines the boundaries of f i ( a i ). It L earns H idden U nit i ’s (positive, zero, or negative) C ontribution (LHUC). ◦ γ i controls the steepness of the curve; ◦ θ applies a horizontal displacement to f i ( a i ). • No bias term is added to f i ( a i ), since it works the same as the bias of the layer. 3 of 13

Parameterised Sigmoid Function • By varying the parameters, p − Sigmoid( η i , γ i , θ i ) can do piecewise approximation to other functions, e.g., step, ReLU, and Soft ReLU. • Can also present tanh, if the bias of the layer is taken into account. 3 f ( a ) p -Sigmoid(1, 1, 0) p -Sigmoid(1, 30, 0) 2 p -Sigmoid(4, 1, 2) p -Sigmoid(3, -2, 3) p -Sigmoid(2, 2, 0) -1 1 a -5 -4 -3 -2 -1 0 1 2 3 4 5 -1 Figure: Piecewise approximation by p − Sigmoid functions. 4 of 13

Parameterised ReLU Function • Associate a scaling factor to either part of the function, to enable the 2 ends of the “hinge” rotate separately around the “pin”. � α i · a i if a i > 0 f i ( a i ) = if a i � 0 . β i · a i 2 f ( a ) 1 -2 -1 0 1 2 a -1 -2 Figure: Illustration of the hinge-like shape p − ReLU function. 5 of 13

EBP for Parameterised Activation Functions • Assume ◦ F is the objective function ◦ i , j are the output and input node numbers of a layer ◦ a i , f i ( · ) are the activation value and activation function of node i ◦ ϑ i is a parameter of f i ( · ), w ji is a (extended) weight • According to the chain rule, there is ∂ F = ∂ f i ( a i ) ∂ F � w ji . ∂ϑ i ∂ϑ i ∂ a j j • Therefore, we need to compute ◦ ∂ f i ( a i ) /∂ a i for training weights & biases; ◦ ∂ f i ( a i ) /∂ϑ i for activation function parameter ϑ i . 6 of 13

Experiments • CE DNN-HMMs were trained on 72 hours Mandarin CTS data. • Three testing sets were used, dev04 , eval03 , and eval97 . ◦ 42d CMLLR(HLDA(PLP 0 D A T Z)+Pitch D A) features. ◦ Context shift set, c = [ − 4 , − 3 , − 2 , − 1 , 0 , +1 , +2 , +3 , +4]. • 63k word dictionary and trigram LM trained using 1 billion words. • DNN structure 378 × 1000 5 × 6005. • Imporved NewBob learning rate scheduler ◦ Sigmoid & p − Sigmoid: τ 0 = 2 . 0 × 10 − 3 , N min = 12 ◦ ReLU & p − ReLU: τ 0 = 5 . 0 × 10 − 4 , N min = 8 • η i , γ i , θ i , α i , and β i are intialised as 1.0, 1.0, 0.0, 1.0, and 0.25. • All GMM-HMM and DNN-HMM acoustic model training and decoding used HTK. 7 of 13

Experiments with p − Sigmoid • Learning η i , γ i , and θ i in PT & FT, and FT only. • Using combinations did not outperform using separate parameters. ID Activation Function dev04 S0 Sigmoid 27.9 S1 + p -Sigmoid( η i , 1 , 0) 27.6 S2 + p -Sigmoid(1 , γ i , 0) 27.7 S3 + p -Sigmoid(1 , 1 , θ i ) 27.7 S1 p -Sigmoid( η i , 1 , 0) 27.1 S2 p -Sigmoid(1 , γ i , 0) 27.5 S3 p -Sigmoid(1 , 1 , θ i ) 27.4 S6 p -Sigmoid( η i , γ i , θ i ) 27.3 Table: dev04 %WER for the p-Sigmoid systems. + means the activation function parameters were trained in both PT and FT. 8 of 13

Experiments with p − ReLU • For ReLU DNNs, it is not useful to avoid the impact on the other parameters at the begining. • α i has more impact on training than β i . ID Activation Function dev04 R0 ReLU 27.6 R1 p -ReLU( α i , 0) 26.8 R2 p -ReLU(1 , β i ) 27.0 R3 p -ReLU( α i , β i ) 27.1 R1 − p -ReLU( α i , 0) 27.4 R2 − p -ReLU(1 , β i ) 27.0 Table: dev04 %WER for the p-ReLU systems. − indicates the activation function parameters were frozen in the 1st epoch. 9 of 13

Results on All Testing Sets • S1 and R1 had 3 . 4% and 2 . 0% lower WER than S0 and R0, by increasing only 0 . 06% parameters. • p -Sigmoid contains gains by making Sigmoid similar to ReLU. • Weighting the contribution of each hidden unit individually is quite useful. ID Activation Function eval97 eval03 dev04 S0 Sigmoid 34.1 29.7 27.9 S1 p -Sigmoid( η i , 1 , 0) 32.9 28.6 27.1 R0 ReLU 33.3 29.1 27.6 R1 p -ReLU( α i , 0) 32.7 28.7 26.8 Table: %WER on all test sets. 10 of 13

Summary • Different types of parameters and their combinations of p − Sigmoid and p − ReLU were analysed and compared. • A scaling factor with no constraint imposed is the most useful. With the linear scaling factors, ◦ p − Sigmoid( η i , 1 , 0) resulted 3 . 4% relative WER reduction compared with Sigmoid. ◦ p − ReLU( α i , 0) reduced the WER by 2 . 0% relative over the ReLU baseline. • Learning different types of parameters simultaneously is difficult. 11 of 13

Appendix: Parameterised Sigmoid Function • To compute the exact derivatives, it is necessary to store a i . � 0 ∂ f i ( a i ) if η i = 0 = if η i � = 0 , 1 − η − 1 � � ∂ a i γ i f i ( a i ) f i ( a i ) i � (1 + e − γ i a i + θ i ) − 1 ∂ f i ( a i ) if η i = 0 = if η i � = 0 , η − 1 f i ( a i ) ∂η i i � 0 ∂ f i ( a i ) if η i = 0 = if η i � = 0 , 1 − η − 1 � � ∂γ i a i f i ( a i ) f i ( a i ) i � 0 ∂ f i ( a i ) if η i = 0 = if η i � = 0 . 1 − η − 1 � � − f i ( a i ) f i ( a i ) ∂θ i i • For p − Sigmoid( η i , 1 , 0), let ∂ f i ( a i ) /∂η i ( η i = 0), a i is not necessary to be stored. 12 of 13

Appendix: Parameterised ReLU Function • Since α i , β i can be any real number, it is not possible to tell the sign of a i from f i ( a i ). � α i ∂ f i ( a i ) if a i > 0 = if a i � 0 , ∂ a i β i � a i ∂ f i ( a i ) if a i > 0 = if a i � 0 , ∂α i 0 � 0 ∂ f i ( a i ) if a i > 0 = if a i � 0 . a i ∂β i • For p − ReLU( α i , 0), let ∂ f i ( a i ) /∂α i ( α i = 0), it is not necessary to store a i . 13 of 13

Parameterised Sigmoid and ReLU Hidden Activation Functions for DNN - PowerPoint PPT Presentation

Parameterised Sigmoid and ReLU Hidden Activation Functions for DNN Acoustic Modelling Chao Zhang & Phil Woodland University of Cambridge 29 April 2015 Introduction The hidden activation function plays an important role in deep

On-line learning in neural networks with ReLU activation Michiel Straat September 19, 2018 1 /

Why Risk Models Should be Why Risk Models Should be Parameterised Parameterised William Marsh,

Compilers Activation Records Alex Aiken Activation Records The information needed to manage

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity Chulhee

Reduce Number of Ops and Weights Exploit Activation Statistics Network Pruning Compact

Nonparametric regression using deep neural networks with ReLU activation function Johannes

Sigmoid curves and a case for close-to-linear nonlinear models Charles Y. Tan charles

Liveness of Randomised Parameterised Systems under Arbitrary Schedulers Anthony W. Lin and

ReLu and Maxout Networks and Their Possible Connections to Tropical Methods J org Zimmermann,

CUDA NEW FEATURES AND BEYOND Stephen Jones, GTC 2019 A QUICK LOOK BACK This Time Last Year...

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, Better Solvers Prof Tom

TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO GAUSSIAN MARGINALS Surbhi Goel

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net $ %

Auralization Technology Maarten Hornikx April 4, 2019 Introduction Auralization technology

General-purpose Ambisonic playback systems for electroacoustic concerts A practical approach

Acoustic streaming modeling Milad Setareh Applied Mechanics/Fluid Dynamics, Amirkabir University

K-shot Learning of Acoustic Context Ivan Bocharov, Tjalling Tjalkens and Bert de Vries Eindhoven

Manfred Kaltenbacher in cooperation with A. Hppe, I. Sim (University of Klagenfurt), G. Cohen

Covariation of Stop Consonant Acoustics: Corpus Evidence and Implications for Talker Adaptation

Poetry Draw inspiration from math, Music science and engineering Experiences

CS 188: Artificial Intelligence Lecture 18: Speech Pieter Abbeel --- UC Berkeley Many slides