recurrent networks and lstms for nlp
play

Recurrent Networks, and LSTMs, for NLP Michael Collins, Columbia - PowerPoint PPT Presentation

Recurrent Networks, and LSTMs, for NLP Michael Collins, Columbia University Representing Sequences Often we want to map some sequence x [1: n ] = x 1 . . . x n to a label y or a distribution p ( y | x [1: n ] ) Examples: Language


  1. Recurrent Networks, and LSTMs, for NLP Michael Collins, Columbia University

  2. Representing Sequences ◮ Often we want to map some sequence x [1: n ] = x 1 . . . x n to a label y or a distribution p ( y | x [1: n ] ) ◮ Examples: ◮ Language modeling: x [1: n ] is first n words in a document, y is the ( n + 1) ’th word ◮ Sentiment analysis: x [1: n ] is a sentence (or document), y is label indicating whether the sentence is positive/neutral/negative about a particular topic (e.g., a particular restaurant) ◮ Machine translation: x [1: n ] is a source-language sentence, y is a target language sentence (or the first word in the target language sentence)

  3. Representing Sequences (continued) ◮ Slightly more generally: map a sequence x [1: n ] and a position i ∈ { 1 . . . n } to a label y or a distribution p ( y | x [1: n ] , i ) ◮ Examples: ◮ Tagging: x [1: n ] is a sentence, i is a position in the sentence, y is the tag for position i ◮ Dependency parsing: x [1: n ] is a sentence, i is a position in the sentence, y ∈ { 1 . . . n } , y � = i is the head for word x i in the dependency parse

  4. A Simple Recurrent Network Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A label y ∈ { 1 . . . K } . An integer m defining size of hidden dimension. Parameters W hh ∈ R m × m , W hx ∈ R m × d , b h ∈ R m , h 0 ∈ R m , V ∈ R K × m , γ ∈ R K . Transfer function g : R m → R m . Definitions: { W hh , W hx , b h , h 0 } θ = g ( W hx x ( t ) + W hh h ( t − 1) + b h ) R ( x ( t ) , h ( t − 1) ; θ ) = Computational Graph: ◮ For t = 1 . . . n ◮ h ( t ) = R ( x ( t ) , h ( t − 1) ; θ ) ◮ l = V h ( n ) + γ, q = LS ( l ) , o = − q y

  5. The Computational Graph

  6. A Problem in Training: Exploding and Vanishing Gradients ◮ Calculation of gradients involves multiplication of long chains of Jacobians ◮ This leads to exploding and vanishing gradients

  7. LSTMs (Long Short-Term Memory units) ◮ Old definitions of the recurrent update: { W hh , W hx , b h , h 0 } θ = g ( W hx x ( t ) + W hh h ( t − 1) + b h ) R ( x ( t ) , h ( t − 1) ; θ ) = ◮ LSTMs give an alternative definition of R ( x ( t ) , h ( t − 1) ; θ ) .

  8. Definition of Sigmoid Function, Element-Wise Product ◮ Given any integer d ≥ 1 , σ d : R d → R d is the function that maps a vector v to a vector σ d ( v ) such that for i = 1 . . . d , e v i σ d i ( v ) = 1 + e v i ◮ Given vectors a ∈ R d and b ∈ R d , c = a ⊙ b has components c i = a i × b i for i = 1 . . . d

  9. LSTM Equations (from Ilya Sutskever, PhD thesis) s t , h t as hidden state at position t . s t is memory , Maintain s t , ˜ intuitively allows long-term memory. The function s t , h t = LSTM ( x t , s t − 1 , ˜ s t , ˜ s t − 1 , h t − 1 ; θ ) is defined as: u t CONCAT ( h t − 1 , x t , ˜ s t − 1 ) = g ( W h u t + b h ) h t = (hidden state) g ( W i u t + b i ) i t = (“input”) σ ( W ι u t + b ι ) ι t = (“input gate”) σ ( W o u t + b o ) o t = (“output gate”) σ ( W f u t + b f ) f t = (“forget gate”) s t − 1 ⊙ f t + i t ⊙ ι t s t = forget and input gates control update of memory s t ⊙ o t s t ˜ = output gate controls information that can leave the unit

  10. An LSTM-based Recurrent Network Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A label y ∈ { 1 . . . K } . Computational Graph: s (0) are set to some inital values. ◮ h (0) , s (0) , ˜ ◮ For t = 1 . . . n s ( t ) , h ( t ) = LSTM ( x ( t ) , s ( t − 1) , ˜ ◮ s ( t ) , ˜ s ( t − 1) , h ( t − 1) ; θ ) ◮ l = V lh h ( n ) + V ls ˜ s ( n ) + γ, q = LS ( l ) , o = − q y

  11. The Computational Graph

  12. An LSTM-based Recurrent Network for Tagging Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A sequence y 1 . . . y n of tags. Computational Graph: s (0) are set to some inital values. ◮ h (0) , s (0) , ˜ ◮ For t = 1 . . . n s ( t ) , h ( t ) = LSTM ( x ( t ) , s ( t − 1) , ˜ ◮ s ( t ) , ˜ s ( t − 1) , h ( t − 1) ; θ ) ◮ For t = 1 . . . n ◮ l t = V × CONCAT ( h ( t ) , ˜ s ( t ) ) + γ, q t = LS ( l t ) , o t = − q y t ◮ o = � n t =1 o t

  13. The Computational Graph

  14. A bi-directional LSTM (bi-LSTM) for tagging Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A sequence y 1 . . . y n of tags. Definitions: θ F and θ B are parameters of a forward and backward LSTM. Computational Graph: α ( n +1) are set to some inital values. ◮ h (0) , s (0) , ˜ s (0) , η ( n +1) , α ( n +1) , ˜ ◮ For t = 1 . . . n s ( t ) , h ( t ) = LSTM ( x ( t ) , s ( t − 1) , ˜ s ( t − 1) , h ( t − 1) ; θ F ) ◮ s ( t ) , ˜ ◮ For t = n . . . 1 α ( t ) , η ( t ) = LSTM ( x ( t ) , α ( t +1) , ˜ ◮ α ( t ) , ˜ α ( t +1) , η ( t +1) ; θ B ) ◮ For t = 1 . . . n ◮ l t = V × CONCAT ( h ( t ) , ˜ α t ) + γ, q t = LS ( l t ) , s ( t ) , η ( t ) , ˜ o t = − q y t ◮ o = � n t =1 o t

  15. The Computational Graph

  16. Results on Language Modeling ◮ Results from One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants.

  17. Results on Dependency Parsing ◮ Deep Biaffine Attention for Neural Dependency Parsing , Dozat and Manning. ◮ Uses a bidirectional LSTM to represent each word ◮ Uses LSTM representations to predict head for each word in the sentence ◮ Unlabeled dependency accuracy: 95.75%

  18. Conclusions ◮ Recurrent units map input sequences x 1 . . . x n to representations h 1 . . . h n . The vector h n can be used to predict a label for the entire sentence. Each vector h i for i = 1 . . . n can be used to make a prediction for position i ◮ LSTMs are recurrent units that make use of more involved recurrent updates. They maintain a “memory” state. Empirically they perform extremely well ◮ Bi-directional LSTMs allow representation of both the information before and after a position i in the sentence ◮ Many applications: language modeling, tagging, parsing, speech recognition, we will soon see machine translation

Recommend


More recommend