Recurrent Neural Networks M. Soleymani Sharif University of - PowerPoint PPT Presentation

Recurrent Neural Networks M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Socher lectures, cs224d, Stanford, 2017. .

“ Vanilla ” Neural Networks “ Vanilla ” NN

Recurrent Neural Networks: Process Sequences e.g. Image Captioning e.g. Video e.g. Machine Translation “ Vanilla ” NN e.g. Sentiment image -> seq of words classification on seq of words -> seq of words Classification frame level seq of words -> sentiment

Recurrent Neural Network usually want to y predict a vector at some time steps RNN x

Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every y time step: RNN new state old state input vector at some time step some function x with parameters W

Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every y time step: RNN Notice: the same function and the same x set of parameters are used at every time step.

(Vanilla) Recurrent Neural Network The state consists of a single “ hidden ” vector h : y RNN x

RNN: Computational Graph

RNN: Computational Graph Re-use the same weight matrix at every time-step

RNN: Computational Graph: Many to One

RNN: Computational Graph: Many to Many

RNN: Computational Graph: One to Many

Sequence to Sequence: Many-to-one + one-to-many

Character-level language model example Vocabulary: [h,e,l,o] y Example training sequence: “ hello ” RNN x

Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: “ hello ”

Example: Character-level Language Model Sampling • Vocabulary: [h,e,l,o] • At test-time sample characters one at a time, feed back to model

min-char-rnn.py gist: 112 lines of Python (https://gist.github.com/karpathy /d4dee566867f8291f086)

Language Modeling: Example I y RNN x

at first: train more train more train more

open source textbook on algebraic geometry Latex source

Generated C code

Searching for interpretable cells

Searching for interpretable cells quote detection cell

Searching for interpretable cells line length tracking cell

Searching for interpretable cells if statement cell

Searching for interpretable cells quote/comment cell if statement cell

Searching for interpretable cells code depth cell

Backpropagation through time Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

Truncated Backpropagation through time Run forward and backward through chunks of the sequence instead of whole sequence

Truncated Backpropagation through time Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

Truncated Backpropagation through time

Example: Language Models • A language model computes a probability for a sequence of words: 𝑄(𝑥 1 , … , 𝑥 𝑈 ) • Useful for machine translation, spelling correction, and … – Word ordering: p(the cat is small) > p(small the is cat) – Word choice: p(walking home after school) > p(walking house after school)

Example: RNN language model • Given list of word vectors: 𝑦 1 , 𝑦 2 , … , 𝑦 𝑈 • At a single time step: ℎ 𝑢 = tanh 𝑋 𝑦ℎ 𝑦 𝑢 + 𝑋 ℎℎ ℎ 𝑢−1 • Output: 𝑧 𝑢 = softmax 𝑋 ℎ𝑧 ℎ 𝑢 𝑄(𝑧 𝑢 = 𝑤 𝑘 |𝑦 1 , … , 𝑦 𝑢 ) ≈ 𝑧 𝑢,𝑘

Example: RNN language model • Given list of word vectors: 𝑦 1 , 𝑦 2 , … , 𝑦 𝑈 • At a single time step: ℎ 𝑢 = tanh 𝑋 𝑦ℎ 𝑦 𝑢 + 𝑋 ℎℎ ℎ 𝑢−1 • Output: 𝑧 𝑢 = softmax 𝑋 ℎ𝑧 ℎ 𝑢 𝑄(𝑧 𝑢 = 𝑤 𝑘 |𝑦 1 , … , 𝑦 𝑢 ) ≈ 𝑧 𝑢,𝑘 ℎ 0 is some initialization vector for the hidden layer at time step 0 𝑦 𝑢 is the column vector at time step t

Example: RNN language model loss 𝑧 ∈ ℝ 𝑊 is a probability distribution over the vocabulary • • Cross entropy loss function at location t of the sequence: |𝑊| 𝑧 𝑢,𝑘 = 1 when 𝑥 𝑢 must be 𝐹 𝑢 = − 𝑧 𝑢,𝑘 log 𝑧 𝑢,𝑘 the word 𝑘 of vocabulary 𝑘=1 • Cost function over the entire sequence: 𝑈 |𝑊| 𝐹 = − 1 𝑈 𝑧 𝑢,𝑘 log 𝑧 𝑢,𝑘 𝑢=1 𝑘=1

Training RNN

Training RNN ℎ 𝑘 = 𝑋 ℎℎ 𝑔 ℎ 𝑘−1 + 𝑋 𝑦ℎ 𝑦 𝑘 𝜖ℎ 𝑘 𝜖ℎ 𝑘,𝑛 𝑈 diag 𝑔 ′ ℎ 𝑘−1 𝑈 𝑜,. 𝑔 ′ 𝑛 = 𝑋 = 𝑋 ℎℎ ℎℎ 𝜖ℎ 𝑘−1,𝑜 𝜖ℎ 𝑘−1 𝜖ℎ 𝑘 𝑔 ′ ℎ 𝑘−1 𝑈 ≤ 𝑋 ≤ 𝛾 𝑋 𝛾 ℎ ℎℎ 𝜖ℎ 𝑘−1 𝑢 𝑢 𝜖ℎ 𝑢 𝜖ℎ 𝑘 𝑈 diag 𝑔 ′ ℎ 𝑘−1 ≤ 𝛾 𝑋 𝛾 ℎ 𝑢−𝑙 = = 𝑋 ℎℎ 𝜖ℎ 𝑙 𝜖ℎ 𝑘−1 𝑘=𝑙+1 𝑘=𝑙+1 • This can become very small or very large quickly (vanishing/exploding gradients) [Bengio et al 1994].

Training RNNs is hard • Multiply the same matrix at each time step during forward prop • Ideally inputs from many time steps ago can modify output y

The vanishing gradient problem: Example • In the case of language modeling words from time steps far away are not taken into consideration when training to predict the next word • Example: Jane walked into the room. John walked in too. It was late in the day. Jane said hi to ____

Vanilla RNN Gradient Flow

Vanilla RNN Gradient Flow Computing gradient of h0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients

Trick for exploding gradient: clipping trick • The solution first introduced by Mikolov is to clip gradients to a maximum value. • Makes a big difference in RNNs.

Gradient clipping intuition • Error surface of a single hidden unit RNN – High curvature walls • Solid lines: standard gradient descent trajectories • Dashed lines gradients rescaled to fixed size

Vanilla RNN Gradient Flow Computing gradient of h0 involves many factors Gradient clipping: Scale Computing of W (and repeated tanh) gradient if its norm is too big Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Change RNN architecture

For vanishing gradients: Initialization + ReLus! • Initialize Ws to identity matrix I and activations to RelU • New experiments with recurrent neural nets. Le et al. A Simple Way to Initialize Recurrent Networks of Rectified Linear Unit, 2015.

Better units for recurrent models • More complex hidden unit computation in recurrence! – ℎ 𝑢 = 𝑀𝑇𝑈𝑁(𝑦 𝑢 , ℎ 𝑢−1 ) – ℎ 𝑢 = 𝐻𝑆𝑉(𝑦 𝑢 , ℎ 𝑢−1 ) • Main ideas: – keep around memories to capture long distance dependencies – allow error messages to flow at different strengths depending on the inputs

Long Short Term Memory (LSTM)

Long-short-term-memories (LSTMs) ℎ 𝑢−1 • Input gate (current cell matterst 𝑗 𝑢 = 𝜏 𝑋 + 𝑐 𝑗 𝑗 𝑦 𝑢 ℎ 𝑢−1 • Forget (gate 0, forget past): 𝑔 𝑢 = 𝜏 𝑋 + 𝑐 𝑔 𝑔 𝑦 𝑢 ℎ 𝑢−1 • Output (how much cell is exposed): 𝑝 𝑢 = 𝜏 𝑋 + 𝑐 𝑝 𝑝 𝑦 𝑢 ℎ 𝑢−1 • New memory cell: 𝑕 𝑢 = tanh 𝑋 + 𝑐 𝑕 𝑕 𝑦 𝑢 • Final memory cell: 𝑑 𝑢 = 𝑗 𝑢 ∘ 𝑕 𝑢 + 𝑔 𝑢 ∘ 𝑑 𝑢−1 • Final hidden state: ℎ 𝑢 = 𝑝 𝑢 ∘ tanh 𝑑 𝑢

Some visualization By Chris Ola: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTM Gates • Gates are ways to let information through (or not): – Forget gate: look at previous cell state and current input, and decide which information to throw away. – Input gate: see which information in the current state we want to update. – Output: Filter cell state and output the filtered result. – Gate or update gate: propose new values for the cell state. • For instance: store gender of subject until another subject is seen.

Recurrent Neural Networks M. Soleymani Sharif University of - PowerPoint PPT Presentation

Recurrent Neural Networks M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Socher lectures, cs224d, Stanford, 2017. . Vanilla

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

Learning text representations from character-level data Grzegorz Chrupa la Department of

Alex Suciu Northeastern University MIMS Summer School: New Trends in Topology and Geometry

Architectures of MIMO Channel Sounder Jun-ichi TAKADA Tokyo Institute of Technology ICT 2010

DifFuzz: Differential Fuzzing for Side-Channel Analysis Shirin Nilizadeh Yannic Noller Corina

Kerberos Working Group UTF-8 Stringprep Profile UTF-8 Stringprep Profile Goals and Principles

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

Logistics Project 2 Trees IV Minimal submission due Sunday Please dont miss the

Strings Ali Taheri Sharif University of Technology Spring 2019 Some slides have been adapted