Some RNN Variants Arun Mallya Best viewed with Computer Modern - PowerPoint PPT Presentation

Some RNN Variants � Arun Mallya � Best viewed with Computer Modern fonts installed �

Outline � • Why Recurrent Neural Networks (RNNs)? � • The Vanilla RNN unit � • The RNN forward pass � • Backpropagation refresher � • The RNN backward pass � • Issues with the Vanilla RNN � • The Long Short-Term Memory (LSTM) unit � • The LSTM Forward & Backward pass � • LSTM variants and tips � – Peephole LSTM � – GRU �

� � � The Vanilla RNN Cell � x t � W � h t � h t-1 � ⎛ ⎞ x t h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 3 ¡

� � � The Vanilla RNN Forward � C 1 � C 2 � C 3 � y 1 � y 2 � y 3 � ⎛ ⎞ x t h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 h 1 � h 2 � h 3 � y t = F( h t ) C t = Loss( y t ,GT t ) x 1 h 0 � x 2 h 1 � x 3 h 2 � 4 ¡

� � � The Vanilla RNN Forward � C 1 � C 2 � C 3 � y 1 � y 2 � y 3 � ⎛ ⎞ x t h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 h 1 � h 2 � h 3 � y t = F( h t ) C t = Loss( y t ,GT t ) indicates shared weights � x 1 h 0 � x 2 h 1 � x 3 h 2 � 5 ¡

� � � The Vanilla RNN Backward � ⎛ ⎞ x t C 1 � C 2 � C 3 � h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 y 1 � y 2 � y 3 � y t = F( h t ) C t = Loss( y t ,GT t ) h 1 � h 2 � h 3 � ⎛ ⎞ ⎛ ⎞ ∂ C t ∂ C t ∂ y t = ⎜ ⎟ ⎜ ⎟ ∂ h 1 ⎝ ∂ y t ⎠ ⎝ ∂ h 1 ⎠ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ∂ C t ∂ y t ∂ h t ⎟ ! ∂ h 2 = ⎜ ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ∂ y t ∂ h t ∂ h t − 1 ∂ h 1 ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ x 1 h 0 � x 2 h 1 � x 3 h 2 � 6 ¡

� � � � The Popular LSTM Cell � x t h t-1 � x t h t-1 � � � ⎛ ⎞ ⎛ ⎞ x t W o � W i � f t = σ W f ⎟ + b f ⎜ ⎜ ⎟ Input Gate � i t � Output Gate � o t � ⎝ ⎠ ⎝ ⎠ h t − 1 Similarly for i t , o t � W � x t � Cell � c t-1 � h t � h t-1 � c t = f t ⊗ c t − 1 + ⎛ ⎞ x t i t ⊗ tanh W ⎜ ⎟ ⎝ ⎠ f t � h t − 1 Forget Gate � W f � h t = o t ⊗ tanh c t x t h t-1 � � 7 ¡ * Dashed line indicates time-lag �

LSTM – Forward/Backward � Go ¡To: ¡ Illustrated LSTM Forward and Backward Pass � 8 ¡

� Class Exercise � • Consider the problem of translation of English to French � • E.g. What is your name Comment tu t'appelle � • Is the below architecture suitable for this problem? � F 1 � F 2 � F 3 � E 1 � E 2 � E 3 � 9 ¡ Adapted from http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf �

� Class Exercise � • Consider the problem of translation of English to French � • E.g. What is your name Comment tu t'appelle � • Is the below architecture suitable for this problem? � F 1 � F 2 � F 3 � E 1 � E 2 � E 3 � • No, sentences might be of di ff erent length and words might not align. Need to see entire sentence before translating � 10 ¡ Adapted from http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf �

� Class Exercise � • Consider the problem of translation of English to French � • E.g. What is your name Comment tu t'appelle � • Sentences might be of di ff erent length and words might not align. Need to see entire sentence before translating � F 4 � F 1 � F 2 � F 3 � E 1 � E 2 � E 3 � • Input-Output nature depends on the structure of the problem at hand � 11 ¡ Seq2Seq Learning with Neural Networks, Sutskever et al. , 2014 �

Multi-layer RNNs � • We can of course design RNNs with multiple hidden layers � y 1 � y 2 � y 3 � y 4 � y 5 � y 6 � x 1 � x 2 � x 3 � x 4 � x 5 � x 6 � • Think exotic: Skip connections across layers, across time, … � 12 ¡

Bi-directional RNNs � • RNNs can process the input sequence in forward and in the reverse direction � y 1 � y 2 � y 3 � y 4 � y 5 � y 6 � x 1 � x 2 � x 3 � x 4 � x 5 � x 6 � • Popular in speech recognition � 13 ¡

� Recap � • RNNs allow for processing of variable length inputs and outputs by maintaining state information across time steps � • Various Input-Output scenarios are possible � (Single/Multiple) � • RNNs can be stacked, or bi-directional � • Vanilla RNNs are improved upon by LSTMs which address the vanishing gradient problem through the CEC � • Exploding gradients are handled by gradient clipping � 14 ¡

� � � � Extension I: Peephole LSTM � x t h t-1 � x t h t-1 � � � ⎛ ⎞ ⎛ ⎞ x t W o � ⎜ ⎟ W i � ⎜ ⎟ f t = σ W f ⎟ + b f h t − 1 Input Gate � i t � Output Gate � o t � ⎜ ⎟ ⎜ ⎝ ⎠ ⎝ ⎠ c t − 1 Similarly for i t , o t ( uses c t ) � W � x t � Cell � c t-1 � h t � h t-1 � c t = f t ⊗ c t − 1 + ⎛ ⎞ x t i t ⊗ tanh W ⎜ ⎟ ⎝ ⎠ f t � h t − 1 Forget Gate � W f � h t = o t ⊗ tanh c t x t h t-1 � � 16 ¡ * Dashed line indicates time-lag �

� � � � Extension I: Peephole LSTM � x t h t-1 � x t h t-1 � � � ⎛ ⎞ ⎛ ⎞ x t W o � ⎜ ⎟ W i � ⎜ ⎟ f t = σ W f ⎟ + b f h t − 1 Input Gate � i t � Output Gate � o t � ⎜ ⎟ ⎜ ⎝ ⎠ ⎝ ⎠ c t − 1 Similarly for i t , o t ( uses c t ) � W � x t � Cell � c t-1 � h t � h t-1 � c t = f t ⊗ c t − 1 + ⎛ ⎞ x t i t ⊗ tanh W ⎜ ⎟ ⎝ ⎠ f t � h t − 1 Forget Gate � W f � h t = o t ⊗ tanh c t x t h t-1 � � 18 ¡ * Dashed line indicates time-lag �

� Peephole LSTM � • Gates can only see the output from the previous time step, which is close to 0 if the output gate is closed. However, these gates control the CEC cell. � • Helped the LSTM learn better timing for the problems tested – Spike timing and Counting spike time delays � Recurrent nets that time and count, Gers et al ., 2000 �

Other minor variants � f t = 1 − i t • Coupled Input and Forget Gate � ⎛ ⎞ ⎛ ⎞ x t ⎜ ⎟ ⎜ ⎟ h t − 1 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ c t − 1 ⎟ f t = σ W f + b f • Full Gate Recurrence � ⎜ ⎟ ⎜ ⎟ i t − 1 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ f t − 1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ o t − 1

LSTM: A Search Space Odyssey � • Tested the following variants, using Peephole LSTM as standard: � 1. No Input Gate (NIG) � 2. No Forget Gate (NFG) � 3. No Output Gate (NOG) � 4. No Input Activation Function (NIAF) � 5. No Output Activation Function (NOAF) � 6. No Peepholes (NP) � 7. Coupled Input and Forget Gate (CIFG) � 8. Full Gate Recurrence (FGR) � • On the tasks of: � – Timit Speech Recognition: Audio frame to 1 of 61 phonemes � – IAM Online Handwriting Recognition: Sketch to characters � – JSB Chorales: Next-step music frame prediction � LSTM: A Search Space Odyssey, Gre ff et al ., 2015 �

LSTM: A Search Space Odyssey � • The standard LSTM performed reasonably well on multiple datasets and none of the modifications significantly improved the performance � • Coupling gates and removing peephole connections simplified the LSTM without hurting performance much � • The forget gate and output activation are crucial � • Found interaction between learning rate and network size to be minimal – indicates calibration can be done using a small network first � LSTM: A Search Space Odyssey, Gre ff et al ., 2015 �

Gated Recurrent Unit (GRU) � • A very simplified version of the LSTM � – Merges forget and input gate into a single ‘update’ gate � – Merges cell and hidden state � • Has fewer parameters than an LSTM and has been shown to outperform LSTM on some tasks � Learning Phrase Representations using RNN Encoder-Decoder for � Statistical Machine Translation, Cho et al ., 2014 �

� GRU � x t h t-1 � � ⎛ ⎞ ⎛ ⎞ x t t = σ W r ⎟ + b f r ⎜ ⎜ ⎟ W z � ⎝ ⎠ ⎝ ⎠ h t − 1 Update Gate � z t � ⎛ ⎞ x t h ' t = tanh W ⎜ ⎟ x t � W � t ⊗ h t − 1 ⎝ ⎠ r h’ t � h t � h t-1 � ⎛ ⎞ ⎛ ⎞ x t z t = σ W z ⎟ + b f ⎜ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ h t − 1 r t � Reset Gate � W f � h t = (1 − z t ) ⊗ h t − 1 + z t ⊗ h ' t x t h t-1 � � 24 ¡

Some RNN Variants Arun Mallya Best viewed with Computer Modern - PowerPoint PPT Presentation

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Consensus Variants Usman Mazhar Mirza 6/17/2013 1 Consensus Variants In the variants we

The game Euclid , its variants, and continued fractions Nhan Bao Ho 23 April 2014 Nhan Bao Ho

Minor variants in HIV-1 Minor variants in HIV-1 Why? Why? University of Cologne Institute of

Influence of the K103N minor variants in Influence of the K103N minor variants in therapy-nave

On the variants of treewidth and minor-closedness property O-joung Kwon KAIST in Daejeon, Korea

Predic'ng 'ssue-specific effects of rare gene'c variants Farhan Damani Biological Data Sciences

On Variants of Modified Bar Recursion Paulo Oliva Queen Mary, University of London, UK

Variants of Turing Machines Variants of Turing Machines p.1/49 Robustness

Theory of Computer Science D4. Halting Problem Variants & Rices Theorem Gabriele R oger

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture adapted from Andrej

RNN-based AMs + Introduction to Language Modeling Lecture 9 CS 753 Instructor: Preethi Jyothi

Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office

Copy Number Variants (CNVs) January 27 th 2015 Fady M. Mikhail, MD, PhD Associate Professor

Verification of Variants using CarMaker Dr. F. Fuhr nderungsdatum: 09.09.2010 Porsche AG

Atmospheric Neutrino Fluxes: The use of muon fluxes to Improve the Accuracy in Low Energies. May,

Many-Task Applications in the Integrated Plasma Simulator Samantha S. Foley, Wael R. Elwasif,

and Applications Lecture 8: Review of Probability Theory Juan Carlos Nieves Snchez November

S OFTWARE S ECURITY AND R ANDOMIZATION THROUGH P ROGRAM P ARTITIONING AND C IRCUIT V ARIATION M

Formal Design of Composite Physically Unclonable Function Durga Prasad Sahoo Debdeep

5/18/2015 City of Florence Neighborhood Redevelopment Strategy South Carolina Community

Inference, aggregation and graphics for top- k rank lists Michael G. Schimek 1 a 2 Shili Lin 3

Extreme scale matrix factorizations in Exploration Seismology Felix J. Herrmann SLIM Georgia

Some RNN Variants Arun Mallya Best viewed with Computer Modern - PowerPoint PPT Presentation

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN &amp; Gated RNN

Consensus Variants Usman Mazhar Mirza 6/17/2013 1 Consensus Variants In the variants we

The game Euclid , its variants, and continued fractions Nhan Bao Ho 23 April 2014 Nhan Bao Ho

Minor variants in HIV-1 Minor variants in HIV-1 Why? Why? University of Cologne Institute of

Influence of the K103N minor variants in Influence of the K103N minor variants in therapy-nave

On the variants of treewidth and minor-closedness property O-joung Kwon KAIST in Daejeon, Korea

Predic'ng 'ssue-specific effects of rare gene'c variants Farhan Damani Biological Data Sciences

On Variants of Modified Bar Recursion Paulo Oliva Queen Mary, University of London, UK

Variants of Turing Machines Variants of Turing Machines p.1/49 Robustness

Theory of Computer Science D4. Halting Problem Variants &amp; Rices Theorem Gabriele R oger

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture adapted from Andrej

RNN-based AMs + Introduction to Language Modeling Lecture 9 CS 753 Instructor: Preethi Jyothi

Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office

Copy Number Variants (CNVs) January 27 th 2015 Fady M. Mikhail, MD, PhD Associate Professor

Verification of Variants using CarMaker Dr. F. Fuhr nderungsdatum: 09.09.2010 Porsche AG

Atmospheric Neutrino Fluxes: The use of muon fluxes to Improve the Accuracy in Low Energies. May,

Many-Task Applications in the Integrated Plasma Simulator Samantha S. Foley, Wael R. Elwasif,

and Applications Lecture 8: Review of Probability Theory Juan Carlos Nieves Snchez November

S OFTWARE S ECURITY AND R ANDOMIZATION THROUGH P ROGRAM P ARTITIONING AND C IRCUIT V ARIATION M

Formal Design of Composite Physically Unclonable Function Durga Prasad Sahoo Debdeep

5/18/2015 City of Florence Neighborhood Redevelopment Strategy South Carolina Community

Inference, aggregation and graphics for top- k rank lists Michael G. Schimek 1 a 2 Shili Lin 3

Extreme scale matrix factorizations in Exploration Seismology Felix J. Herrmann SLIM Georgia

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Theory of Computer Science D4. Halting Problem Variants & Rices Theorem Gabriele R oger