Understanding LSTM Networks Recurrent Neural Networks An unrolled - PowerPoint PPT Presentation

Understanding LSTM Networks

Recurrent Neural Networks

An unrolled recurrent neural network

The Problem of Long-Term Dependencies

RNN short-term dependencies Language model trying to predict the next word based on the previous ones the clouds are in the sky, h 0 h 1 h 2 h 3 h 4 A A A A A x 0 x 1 x 2 x 3 x 4

RNN long-term dependencies Language model trying to predict the next word based on the previous ones I grew up in India… I speak fluent Hindi. h 0 h 1 h 2 h t − 1 h t A A A A A x 0 x 1 x 2 x t − 1 x t

Standard RNN

Backpropagation Through Time (BPTT)

RNN forward pass s t = tanh ( Ux t + Ws t − 1 ) ^ y t = softmax ( Vs t ) V V V V V W W W W W y )=− ∑ U U U U U E ( y , ^ E t ( y t , ^ y t ) t

Backpropagation Through Time ∂ E t ∂ E ∂ W = ∑ ∂ W t ∂ E 3 ∂ W =∂ E 3 ∂ ^ y 3 ∂ s 3 ∂ ^ y 3 ∂ s 3 ∂ W s 3 = tanh ( Ux t + Ws 2 ) But S_3 depends on s_2, which depends on W and s_1, and so on. 3 ∂ E 3 ∂ ^ ∂ E 3 ∂ s 3 ∂ s k y 3 ∂ W = ∑ ∂ ^ ∂ s 3 ∂ s k ∂ W y 3 k = 0

The Vanishing Gradient Problem 3 ∂ E 3 ∂ ^ ∂ E 3 y 3 ∂ s 3 ∂ s k ∂ W = ∑ ∂ ^ ∂ s 3 ∂ s k ∂ W y 3 k = 0 3 ∂ E 3 ∂ s 3 ( ∏ ∂ s j − 1 ) ∂ ^ 3 ∂ E 3 y 3 ∂ s j ∂ s k ∂ W = ∑ ∂ ^ ∂ W y 3 k = 0 j = k + 1 ● Derivative of a vector w.r.t a vector is a matrix called jacobian ● 2-norm of the above Jacobian matrix has an upper bound of 1 ● tanh maps all values into a range between -1 and 1, and the derivative is bounded by 1 ● With multiple matrix multiplications, gradient values shrink exponentially ● Gradient contributions from “far away” steps become zero ● Depending on activation functions and network parameters, gradients could explode instead of vanishing

Activation function

Basic LSTM

Unrolling the LSTM through time

Constant error carousel s t = tanh ( Ux t + Ws t − 1 ) o t C t ⋅ o t Replaced by σ Π C t = ~ ( t ) + C t − 1 C t ⋅ i c σ Edge to next Π time step i t Edge from previous ~ σ C t time step (and current input) Weight fixed at 1

Input gate ● Use contextual information to decide Store input into memory ● Protect memory from overwritten ● by other irrelevant inputs o t C t ⋅ o t σ Π C t = ~ ( t ) + C t − 1 C t ⋅ i c σ Edge to next Π time step i t Edge from previous ~ σ C t time step (and current input) Weight fixed at 1

Output gate ● Use contextual information to decide Access information in memory ● Block irrelevant information ● o t C t ⋅ o t σ Π C t = ~ ( t ) + C t − 1 C t ⋅ i c σ Edge to next Π time step i t Edge from previous ~ σ C t time step (and current input) Weight fixed at 1

Forget or reset gate o t C t ⋅ o t σ Π f t σ C t = ~ ( t ) + C t − 1 ⋅ f t C t ⋅ i c Π σ Edge to next Π time step i t Edge from previous ~ σ C t time step (and current input) Weight fixed at 1

LSTM with four interacting layers

The cell state

Gates sigmoid layer

Step-by-Step LSTM Walk Through

Forget gate layer

Input gate layer

The current state

Output layer

Refrence ● http://colah.github.io/posts/2015-08-Understanding-LSTMs/ ● http://www.wildml.com/ ● http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-netwo rks/ ● http://deeplearning.net/tutorial/lstm.html ● https://theclevermachine.files.wordpress.com/2014/09/act-funs.png ● http://blog.terminal.com/demistifying-long-short-term-memory-lstm-recurrent -neural-networks/ ● A Critical Review of Recurrent Neural Networks for Sequence Learning, Zachary C. Lipton, John Berkowitz ● Long Short-Term Memory, Hochreiter, Sepp and Schmidhuber, Jurgen, 1997 ● Gers, F. A.; Schmidhuber, J. & Cummins, F. A. (2000), 'Learning to Forget: Continual Prediction with LSTM.', Neural Computation 12 (10) , 2451-2471 .

Understanding LSTM Networks Recurrent Neural Networks An unrolled - PowerPoint PPT Presentation

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The Problem of Long-Term Dependencies RNN short-term dependencies Language model trying to predict the next word based on the previous ones the clouds

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Encoder-decoder Models

Multi-Dimensional LSTM Networks for Video Prediction Wonmin Byeon NVIDIA Research March 29, 2018

Class 15 - Long Short-Term Memory (LSTM) Class 15 - Long Short-Term Memory (LSTM) Study materials

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting LSTM

An Introduction to Neural Networks Long Short Term Memory (LSTM) and the Attention mechanism Ange

Framewise Phoneme Classification with Bidirectional LSTM Networks Alex Graves and Jurgen

Bus Arrival Time Prediction with LSTM Neural Network A. Agafonov, A. Yumaganov Samara National

AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois Fleuret

LSTM: A Search Space Odyssey Klaus Greff, Rupesh K. Srivastava, Jan Koutn k, Bas R.

LSTM Neural Reordering Model for Statistical Machine Translation Yiming Cui, Shijin Wang,

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representation Eliyahu

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

of Traffic Congestion Using LSTM Networks Sanchita Basak 1 , Abhishek Dubey 1 , Bruno Leao 2

Deep Learning: Theory and Practice 30-04-2019 Recurrent Neural Networks Introduction The

Recurrent Neural Networks + LSTMs + Attention Surag Nair (based on slides by Xavier

Cognitive Psychology Philipp Koehn 13 February 2020 Philipp Koehn Artificial Intelligence:

2. Cognitive Perspective of Learning Cognition: Big Questions How do things out there

GESIS Survey Guidelines Timo Lenzner and Natalja Menold These slides are based on the GESIS

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks by Kai

LSTM M Based sed Ada dapt ptive ive Fil ilterin ering g for r Redu duced ced Pre redi

PELOTON THE SELF-DRIVING DBMS 2008 5,000 txn/sec H-Store: A High-Performance, Distributed Main