2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP’18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors : S. Scardapane, S. Van Vaerenbergh, D. Comminiello, S. Totaro and A. Uncini
Contents Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline
Content at a glance Setting : Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks, ...). State-of-the-art : Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective : Design an enhanced gate, with a small number of addi- tional adaptable parameters, to model a wider range of gat- ing functions.
Content at a glance Setting : Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks, ...). State-of-the-art : Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective : Design an enhanced gate, with a small number of addi- tional adaptable parameters, to model a wider range of gat- ing functions.
Content at a glance Setting : Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks, ...). State-of-the-art : Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective : Design an enhanced gate, with a small number of addi- tional adaptable parameters, to model a wider range of gat- ing functions.
Contents Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline
Gated unit: basic model Definition: (vanilla) gated unit For a generic input x we have: g ( x ) = σ ( Wx ) ⊙ f ( x ) , (1) where σ ( · ) is the sigmoid function, ⊙ is the element-wise multiplication, and f ( x ) a generic network component. Notable examples: ◮ LSTM networks (Hochreiter and Schmidhuber, 1997). ◮ Gated recurrent units (Cho et al., 2014). ◮ Highway networks (Srivastava et al., 2015). ◮ Neural arithmetic logic unit (Trask et al., 2018).
Gated recurrent unit (GRU) At each time step t we receive x t ∈ R d and update the in- ternal state h t − 1 as: u t = σ ( W u x t + V u h t − 1 + b u ) , (2) r t = σ ( W r x t + V r h t − 1 + b r ) , (3) h t = ( 1 − u t ) ◦ h t − 1 + � � u t ◦ tanh W h x t + U t ( r t ◦ h t − 1 ) + b h (4) , where (2)-(3) are the update gate and reset gate . Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation . EMNLP 2014 .
Gated recurrent unit (GRU) At each time step t we receive x t ∈ R d and update the in- ternal state h t − 1 as: u t = σ ( W u x t + V u h t − 1 + b u ) , (2) r t = σ ( W r x t + V r h t − 1 + b r ) , (3) h t = ( 1 − u t ) ◦ h t − 1 + � � u t ◦ tanh W h x t + U t ( r t ◦ h t − 1 ) + b h (4) , where (2)-(3) are the update gate and reset gate . Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation . EMNLP 2014 .
Gated recurrent unit (GRU) At each time step t we receive x t ∈ R d and update the in- ternal state h t − 1 as: u t = σ ( W u x t + V u h t − 1 + b u ) , (2) r t = σ ( W r x t + V r h t − 1 + b r ) , (3) h t = ( 1 − u t ) ◦ h t − 1 + � � u t ◦ tanh W h x t + U t ( r t ◦ h t − 1 ) + b h (4) , where (2)-(3) are the update gate and reset gate . Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation . EMNLP 2014 .
Training the network (classification) � � N i = 1 with labels y i = 1 , . . . , C . h i is the in- x i N sequences t ternal state of the GRU after processing the i -th sequence. This is fed through another layer with a softmax activation function for classification: � � y i = softmax Ah i + b � (5) , We then minimize the average cross-entropy between the real classes and the predicted classes: � � N C � � � � J ( θ ) = − 1 y i = c y i � log (6) , j N i = 1 c = 1
Training the network (classification) � � N i = 1 with labels y i = 1 , . . . , C . h i is the in- x i N sequences t ternal state of the GRU after processing the i -th sequence. This is fed through another layer with a softmax activation function for classification: � � y i = softmax Ah i + b � (5) , We then minimize the average cross-entropy between the real classes and the predicted classes: � � N C � � � � J ( θ ) = − 1 y i = c y i � log (6) , j N i = 1 c = 1
Training the network (classification) � � N i = 1 with labels y i = 1 , . . . , C . h i is the in- x i N sequences t ternal state of the GRU after processing the i -th sequence. This is fed through another layer with a softmax activation function for classification: � � y i = softmax Ah i + b � (5) , We then minimize the average cross-entropy between the real classes and the predicted classes: � � N C � � � � J ( θ ) = − 1 y i = c y i � log (6) , j N i = 1 c = 1
Contents Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline
Summary of the proposal Key items of our proposal : 1. Maintain the linear component, but replace the sig- moid element-wise operation with a generalized sig- moid function. 2. We extend the kernel activation function (KAF), a re- cently proposed non-parametric activation function. 3. We modify the KAF to ensure that it behaves correctly as a gating function.
Summary of the proposal Key items of our proposal : 1. Maintain the linear component, but replace the sig- moid element-wise operation with a generalized sig- moid function. 2. We extend the kernel activation function (KAF), a re- cently proposed non-parametric activation function. 3. We modify the KAF to ensure that it behaves correctly as a gating function.
Summary of the proposal Key items of our proposal : 1. Maintain the linear component, but replace the sig- moid element-wise operation with a generalized sig- moid function. 2. We extend the kernel activation function (KAF), a re- cently proposed non-parametric activation function. 3. We modify the KAF to ensure that it behaves correctly as a gating function.
Basic structure of the KAF A KAF models each activation function in terms of a kernel expansion over D terms as: D � KAF ( s ) = α i κ ( s , d i ) , (7) i = 1 where: 1. { α i } D i = 1 are the mixing coefficients ; 2. { d i } D i = 1 are the dictionary elements ; 3. κ ( · , · ) : R × R → R is a 1D kernel function . Scardapane, S., Van Vaerenbergh, S., Totaro, S. and Uncini, A., 2017. Kafnets: kernel-based non-parametric activation functions for neural networks . arXiv preprint arXiv:1707.04035 .
Extending KAFs for gated units We cannot use a KAF straightforwardly because it is un- bounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: � 1 � 2KAF ( s ) + 1 σ KAF ( s ) = σ 2 s (8) . As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.
Extending KAFs for gated units We cannot use a KAF straightforwardly because it is un- bounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: � 1 � 2KAF ( s ) + 1 σ KAF ( s ) = σ 2 s (8) . As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.
Extending KAFs for gated units We cannot use a KAF straightforwardly because it is un- bounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: � 1 � 2KAF ( s ) + 1 σ KAF ( s ) = σ 2 s (8) . As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.
Visualizing the new gates Value of the gate Value of the gate Value of the gate −5 0 5 −5 0 5 −5 0 5 Activation Activation Activation (a) γ = 1.0 (b) γ = 0.5 (c) γ = 0.1 Figure 1: Random samples of the proposed flexible gates with Gaussian kernel and different hyperparameters.
Initializing the mixing coefficients To simplify optimization we initialize the mixing coeffi- cients to approximate the identity function: α = ( K + ε I ) − 1 d , (9) where ε > 0 is a small constant. We then use a different set of mixing coefficients for each forget gate and update gate. 1.0 0.5 Gate output 0.0 −0.5 −1.0 −2 0 2 Activation
Contents Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline
Recommend
More recommend