Recurrent Networks, and LSTMs, for NLP Michael Collins, Columbia - PowerPoint PPT Presentation

Recurrent Networks, and LSTMs, for NLP Michael Collins, Columbia University

Representing Sequences ◮ Often we want to map some sequence x [1: n ] = x 1 . . . x n to a label y or a distribution p ( y | x [1: n ] ) ◮ Examples: ◮ Language modeling: x [1: n ] is first n words in a document, y is the ( n + 1) ’th word ◮ Sentiment analysis: x [1: n ] is a sentence (or document), y is label indicating whether the sentence is positive/neutral/negative about a particular topic (e.g., a particular restaurant) ◮ Machine translation: x [1: n ] is a source-language sentence, y is a target language sentence (or the first word in the target language sentence)

Representing Sequences (continued) ◮ Slightly more generally: map a sequence x [1: n ] and a position i ∈ { 1 . . . n } to a label y or a distribution p ( y | x [1: n ] , i ) ◮ Examples: ◮ Tagging: x [1: n ] is a sentence, i is a position in the sentence, y is the tag for position i ◮ Dependency parsing: x [1: n ] is a sentence, i is a position in the sentence, y ∈ { 1 . . . n } , y � = i is the head for word x i in the dependency parse

A Simple Recurrent Network Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A label y ∈ { 1 . . . K } . An integer m defining size of hidden dimension. Parameters W hh ∈ R m × m , W hx ∈ R m × d , b h ∈ R m , h 0 ∈ R m , V ∈ R K × m , γ ∈ R K . Transfer function g : R m → R m . Definitions: { W hh , W hx , b h , h 0 } θ = g ( W hx x ( t ) + W hh h ( t − 1) + b h ) R ( x ( t ) , h ( t − 1) ; θ ) = Computational Graph: ◮ For t = 1 . . . n ◮ h ( t ) = R ( x ( t ) , h ( t − 1) ; θ ) ◮ l = V h ( n ) + γ, q = LS ( l ) , o = − q y

The Computational Graph

A Problem in Training: Exploding and Vanishing Gradients ◮ Calculation of gradients involves multiplication of long chains of Jacobians ◮ This leads to exploding and vanishing gradients

LSTMs (Long Short-Term Memory units) ◮ Old definitions of the recurrent update: { W hh , W hx , b h , h 0 } θ = g ( W hx x ( t ) + W hh h ( t − 1) + b h ) R ( x ( t ) , h ( t − 1) ; θ ) = ◮ LSTMs give an alternative definition of R ( x ( t ) , h ( t − 1) ; θ ) .

Definition of Sigmoid Function, Element-Wise Product ◮ Given any integer d ≥ 1 , σ d : R d → R d is the function that maps a vector v to a vector σ d ( v ) such that for i = 1 . . . d , e v i σ d i ( v ) = 1 + e v i ◮ Given vectors a ∈ R d and b ∈ R d , c = a ⊙ b has components c i = a i × b i for i = 1 . . . d

LSTM Equations (from Ilya Sutskever, PhD thesis) s t , h t as hidden state at position t . s t is memory , Maintain s t , ˜ intuitively allows long-term memory. The function s t , h t = LSTM ( x t , s t − 1 , ˜ s t , ˜ s t − 1 , h t − 1 ; θ ) is defined as: u t CONCAT ( h t − 1 , x t , ˜ s t − 1 ) = g ( W h u t + b h ) h t = (hidden state) g ( W i u t + b i ) i t = (“input”) σ ( W ι u t + b ι ) ι t = (“input gate”) σ ( W o u t + b o ) o t = (“output gate”) σ ( W f u t + b f ) f t = (“forget gate”) s t − 1 ⊙ f t + i t ⊙ ι t s t = forget and input gates control update of memory s t ⊙ o t s t ˜ = output gate controls information that can leave the unit

An LSTM-based Recurrent Network Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A label y ∈ { 1 . . . K } . Computational Graph: s (0) are set to some inital values. ◮ h (0) , s (0) , ˜ ◮ For t = 1 . . . n s ( t ) , h ( t ) = LSTM ( x ( t ) , s ( t − 1) , ˜ ◮ s ( t ) , ˜ s ( t − 1) , h ( t − 1) ; θ ) ◮ l = V lh h ( n ) + V ls ˜ s ( n ) + γ, q = LS ( l ) , o = − q y

An LSTM-based Recurrent Network for Tagging Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A sequence y 1 . . . y n of tags. Computational Graph: s (0) are set to some inital values. ◮ h (0) , s (0) , ˜ ◮ For t = 1 . . . n s ( t ) , h ( t ) = LSTM ( x ( t ) , s ( t − 1) , ˜ ◮ s ( t ) , ˜ s ( t − 1) , h ( t − 1) ; θ ) ◮ For t = 1 . . . n ◮ l t = V × CONCAT ( h ( t ) , ˜ s ( t ) ) + γ, q t = LS ( l t ) , o t = − q y t ◮ o = � n t =1 o t

A bi-directional LSTM (bi-LSTM) for tagging Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A sequence y 1 . . . y n of tags. Definitions: θ F and θ B are parameters of a forward and backward LSTM. Computational Graph: α ( n +1) are set to some inital values. ◮ h (0) , s (0) , ˜ s (0) , η ( n +1) , α ( n +1) , ˜ ◮ For t = 1 . . . n s ( t ) , h ( t ) = LSTM ( x ( t ) , s ( t − 1) , ˜ s ( t − 1) , h ( t − 1) ; θ F ) ◮ s ( t ) , ˜ ◮ For t = n . . . 1 α ( t ) , η ( t ) = LSTM ( x ( t ) , α ( t +1) , ˜ ◮ α ( t ) , ˜ α ( t +1) , η ( t +1) ; θ B ) ◮ For t = 1 . . . n ◮ l t = V × CONCAT ( h ( t ) , ˜ α t ) + γ, q t = LS ( l t ) , s ( t ) , η ( t ) , ˜ o t = − q y t ◮ o = � n t =1 o t

Results on Language Modeling ◮ Results from One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants.

Results on Dependency Parsing ◮ Deep Biaffine Attention for Neural Dependency Parsing , Dozat and Manning. ◮ Uses a bidirectional LSTM to represent each word ◮ Uses LSTM representations to predict head for each word in the sentence ◮ Unlabeled dependency accuracy: 95.75%

Conclusions ◮ Recurrent units map input sequences x 1 . . . x n to representations h 1 . . . h n . The vector h n can be used to predict a label for the entire sentence. Each vector h i for i = 1 . . . n can be used to make a prediction for position i ◮ LSTMs are recurrent units that make use of more involved recurrent updates. They maintain a “memory” state. Empirically they perform extremely well ◮ Bi-directional LSTMs allow representation of both the information before and after a position i in the sentence ◮ Many applications: language modeling, tagging, parsing, speech recognition, we will soon see machine translation

Recurrent Networks, and LSTMs, for NLP Michael Collins, Columbia - PowerPoint PPT Presentation

Recurrent Networks, and LSTMs, for NLP Michael Collins, Columbia University Representing Sequences Often we want to map some sequence x [1: n ] = x 1 . . . x n to a label y or a distribution p ( y | x [1: n ] ) Examples: Language

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Recurrent Networks: Stability analysis and LSTMs 1 Which open source project? 2 Related math.

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

LSTMs Overview Subhashini Venugopalan Neural Networks z t Output B Hidden Hidden Input WHY

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of

Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of

Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Adam Lopez

Recurrent Networks: Stability analysis and LSTMs 1 Y(t+6) Story so far Stock vector X(t)

Recurrent Neural Networks + LSTMs + Attention Surag Nair (based on slides by Xavier

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1 Outline

Yaos Garbled Circuits Recent Directions and Implementations Pete Snyder Outline 1. Context

Reintroducing Gen 15 The OTs Doubting Thomas Gen 15:1-5 What can you give me if I

On the Security of Hash Functions Employing Blockcipher Post-processing Donghoon Chang 1 , Mridul

Extending Oblivious Transfers Efficiently Yuval Ishai Technion Joe Kilian Kobbi Nissim

Bridging the gap between Optimal Transport and MMD with Sinkhorn Divergences Aude Genevay MIT

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Repairing Entities using Star Constraints in Multi-relational Graphs Peng Lin 1 Qi Song 1 Yinghui

1 An Approach for Secure Edge Computing in the Internet of Things Markus Endler,