Recurrent Neural Networks Luke Zettlemoyer (Slides adapted from - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Recurrent Neural Networks Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy)

Overview • What is a recurrent neural network (RNN)? • Simple RNNs • Backpropagation through time • Long short-term memory networks (LSTMs) • Applications • Variants: Stacked RNNs, Bidirectional RNNs

Recurrent neural networks (RNNs) A class of neural networks allowing to handle variable length inputs A function: y = RNN ( x 1 , x 2 , …, x n ) ∈ ℝ d where x 1 , …, x n ∈ ℝ d in

Recurrent neural networks (RNNs) Proven to be an highly effective approach to language modeling, sequence tagging as well as text classification tasks: Language modeling Sequence tagging Text classification 👏 The sucks . movie

Recurrent neural networks (RNNs) Form the basis for the modern approaches to machine translation, question answering and dialogue:

<latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">ACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3/vuBsfNj9ube+0Pn3+8nX29vmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmfOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLi0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit> <latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">ACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3/vuBsfNj9ube+0Pn3+8nX29vmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmfOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLi0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit> <latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">ACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3/vuBsfNj9ube+0Pn3+8nX29vmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmfOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLi0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit> <latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">ACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3/vuBsfNj9ube+0Pn3+8nX29vmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmfOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLi0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit> Why variable-length? Recall the feedfoward neural LMs we learned: x = [ e the , e dogs , e are ] ∈ R 3 d The dogs are barking (fixed-window size = 3) the dogs in the neighborhood are ___

Simple RNNs h 0 ∈ ℝ d is an initial state h t = f ( h t − 1 , x t ) ∈ ℝ d h t : hidden states which store information from to x 1 x t Simple RNNs : h t = g ( Wh t − 1 + Ux t + b ) ∈ ℝ d : nonlinearity (e.g. tanh), g W ∈ ℝ d × d , U ∈ ℝ d × d in , b ∈ ℝ d

Simple RNNs h t = g ( Wh t − 1 + Ux t + b ) ∈ ℝ d Key idea: apply the same weights repeatedly W

RNNs vs Feedforward NNs

̂ Recurrent Neural Language Models (RNNLMs) P ( w 1 , w 2 , …, w n ) = P ( w 1 ) × P ( w 2 ∣ w 1 ) × P ( w 3 ∣ w 1 , w 2 ) × … × P ( w n ∣ w 1 , w 2 , …, w n − 1 ) = P ( w 1 ∣ h 0 ) × P ( w 2 ∣ h 1 ) × P ( w 3 ∣ h 2 ) × … × P ( w n ∣ h n − 1 ) • y t = softmax ( W o h t ) W o ∈ ℝ | V | × d Denote , • Cross-entroy loss: n … L ( θ ) = − 1 ∑ log ̂ y t − 1 ( w t ) n t =1 θ = { W , U , b , W o , E } the students opened their exams …

Training RNNLMs • Backpropagation? Yes, but not that simple! • The algorithm is called Backpropagation Through Time (BPTT).

Backpropagation through time h 1 = g ( Wh 0 + Ux 1 + b ) h 2 = g ( Wh 1 + Ux 2 + b ) h 3 = g ( Wh 2 + Ux 3 + b ) L 3 = − log ̂ y 3 ( w 4 ) You should know how to compute: ∂ L 3 ∂ h 3 ∂ W = ∂ L 3 ∂ L 3 ∂ W + ∂ L 3 ∂ h 3 ∂ h 3 ∂ W + ∂ L 3 ∂ h 3 ∂ h 2 ∂ h 2 ∂ h 1 ∂ h 3 ∂ h 3 ∂ h 2 ∂ h 3 ∂ h 2 ∂ h 1 ∂ W n t t ∂ h j ∂ L t ∂ h k ∂ W = − 1 ∂ L ∑ ∑ ∏ ∂ h t ∂ h j − 1 ∂ W n t =1 k =1 j = k +1

Truncated backpropagation through time • Backpropagation is very expensive if you handle long sequences • Run forward and backward through chunks of the sequence instead of whole sequence • Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

Progress on language models On the Penn Treebank (PTB) dataset Metric: perplexity KN5: Kneser-Ney 5-gram (Mikolov and Zweig, 2012): Context dependent recurrent neural network language model

Progress on language models On the Penn Treebank (PTB) dataset Metric: perplexity (Yang et al, 2018): Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

(advanced) Vanishing/exploding gradients • Consider the gradient of at step , with respect to the hidden state L t t at some previous step ( ): h k k k < t ∂ h j ∂ L t = ∂ L t ∏ ∂ h k ∂ h t ∂ h j − 1 t ≥ j > k t ≥ j > k ( diag ( g ′ � ( Wh j − 1 + Ux j + b ) ) W ) × ∏ = ∂ L t ∂ h t • (Pascanu et al, 2013) showed that if the largest eigenvalue of is less than 1 W for , then the gradient will shrink exponentially. This problem is g = tanh called vanishing gradients . • In contrast, if the gradients are getting too large, it is called exploding gradients .

Why is exploding gradient a problem? • Gradients become too big and we take a very large step in SGD. • Solution : Gradient clipping — if the norm of the gradient is greater than some threshold, scale it down before applying SGD update.

Why is vanishing gradient a problem? • If the gradients becomes vanishingly small over long distances (step to k step ), then we can’t tell whether: t • We don’t need long-term dependencies • We have wrong parameters to capture the true dependency the dogs in the neighborhood are ___ Still difficult to predict “barking” • How to fix vanishing gradient problem? • LSTMs: Long short-term memory networks • GRUs: Gated recurrent units

Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients problem • Work extremely well in practice • Basic idea : turning multiplication into addition • Use “gates” to control how much information to add/erase h t = f ( h t − 1 , x t ) ∈ ℝ d • At each timestep, there is a hidden state h t ∈ ℝ d c t ∈ ℝ d and also a cell state • stores long-term information c t • We write/erase after each step c t • We read from h t c t

Long Short-term Memory (LSTM) There are 4 gates: • Input gate (how much to write): i t = σ ( W ( i ) h t − 1 + U ( i ) x t + b ( i ) ) ∈ ℝ d • Forget gate (how much to erase): f t = σ ( W ( f ) h t − 1 + U ( f ) x t + b ( f ) ) ∈ ℝ d • Output gate (how much to reveal): o t = σ ( W ( o ) h t − 1 + U ( o ) x t + b ( o ) ) ∈ ℝ d • New memory cell (what to write): c t = tanh( W ( c ) h t − 1 + U ( c ) x t + b ( c ) ) ∈ ℝ d ˜ • Final memory cell: c t = f t ⊙ c t − 1 + i t ⊙ ˜ c t • Final hidden cell: h t = o t ⊙ c t element-wise product How many parameters in total?

Long Short-term Memory (LSTM) • LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies • LSTMs were invented in 1997 but finally got working from 2013-2015.

Is the LSTM architecture optimal? (Jozefowicz et al, 2015): An Empirical Exploration of Recurrent Network Architectures

Overview • What is a recurrent neural network (RNN)? • Simple RNNs • Backpropagation through time • Long short-term memory networks (LSTMs) • Applications • Variants: Stacked RNNs, Bidirectional RNNs

Application: Text Generation You can generate text by repeated sampling. Sampled output is next step’s input.

Fun with RNNs Obama speeches Latex generation Andrej Karpathy “The Unreasonable Effectiveness of Recurrent Neural Networks”

Application: Sequence Tagging Input: a sentence of n words: x 1 , …, x n Output: y 1 , …, y n , y i ∈ {1,… C } W o ∈ ℝ C × d P ( y i = k ) = softmax k ( W o h i ) n L = − 1 ∑ log P ( y i = k ) n i =1

Application: Text Classification Input: a sentence of n words Output: y ∈ {1,2,…, C } h n was ! the movie terribly exciting W o ∈ ℝ C × d P ( y = k ) = softmax k ( W o h n )

Recurrent Neural Networks Luke Zettlemoyer (Slides adapted from - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Recurrent Neural Networks Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy) Overview What is a recurrent neural network (RNN)? Simple RNNs

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

PELOTON THE SELF-DRIVING DBMS 2008 5,000 txn/sec H-Store: A High-Performance, Distributed Main

LSTM M Based sed Ada dapt ptive ive Fil ilterin ering g for r Redu duced ced Pre redi

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks by Kai

GESIS Survey Guidelines Timo Lenzner and Natalja Menold These slides are based on the GESIS

Student Success ICTCM 2020 Diane Hollister Diane.Hollister@pearson.com Presentation Title Arial

Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory N AJMEH S

Contact me about expert training for 10/24/2012 www.ellenfinkelstein.com teams and individuals!

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by