Natural Language Processing Anoop Sarkar - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019 0

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1: Long distance dependencies 1

Long distance dependencies Example ◮ He doesn’t have very much confidence in himself ◮ She doesn’t have very much confidence in herself n-gram Language Models: P ( w i | w i − 1 i − n +1 ) P (himself | confidence , in) P (herself | confidence , in) What we want: P ( w i | w < i ) P (himself | He , . . . , confidence) P (herself | She , . . . , confidence) 2

Long distance dependencies Other examples ◮ Selectional preferences : I ate lunch with a fork vs. I ate lunch with a backpack ◮ Topic : Babe Ruth was able to touch the home plate yet again vs. Lucy was able to touch the home audiences with her humour ◮ Register : Consistency of register in the entire sentence, e.g. informal (Twitter) vs. formal (scientific articles) 3

Language Models Chain Rule and ignore some history: the trigram model p ( w 1 , . . . , w n ) ≈ p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) . . . p ( w n | w n − 2 , w n − 1 ) � ≈ p ( w t +1 | w t − 1 , w t ) t How can we address the long-distance issues? ◮ Skip n -gram models. Skip an arbitrary distance for n -gram context. ◮ Variable n in n -gram models that is adaptive ◮ Problems : Still ”all or nothing”. Categorical rather than soft. 4

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 2: Neural Language Models 5

Neural Language Models Use Chain rule and approximate using a neural network � p ( w 1 , . . . , w n ) ≈ p ( w t +1 | φ ( w 1 , . . . , w t ) ) � �� t capture history with vector s ( t ) Recurrent Neural Network ◮ Let y be the output w t +1 for current word w t and history w 1 , . . . , w t ◮ s ( t ) = f ( U xh · w ( t ) + W hh · s ( t − 1)) where f is sigmoid / tanh ◮ s ( t ) encapsulates history using single vector of size h ◮ Output word at time step w t +1 is provided by y ( t ) ◮ y ( t ) = g ( V hy · s ( t )) where g is softmax 6

Neural Language Models Recurrent Neural Network ◮ Input layer is a one hot vector and Single time step in RNN: output layer y have the same dimensionality as vocabulary (10K-200K). ◮ One hot vector is used to look up word embedding w ◮ “Hidden” layer s is orders of magnitude smaller (50-1K neurons) ◮ U is the matrix of weights between input and hidden layer ◮ V is the matrix of weights between hidden and output layer ◮ Without recurrent weights W , this is equivalent to a bigram feedforward language model 7

Neural Language Models Recurrent Neural Network y (1) y (2) y (3) y (4) y (5) y (6) V hy V hy V hy V hy V hy V hy W hh W hh W hh W hh W hh s (1) s (2) s (3) s (4) s (5) s (6) U xh U xh U xh U xh U xh U xh w (1) w (2) w (3) w (4) w (5) w (6) What is stored and what is computed: ◮ Model parameters: w ∈ R x (word embeddings); U xh ∈ R x × h ; W hh ∈ R h × h ; V hy ∈ R h × y where y = |V| . ◮ Vectors computed during forward pass: s ( t ) ∈ R h ; y ( t ) ∈ R y and each y ( t ) is a probability over vocabulary V . 8

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 3: Training RNN Language Models 9

Neural Language Models Recurrent Neural Network Computational Graph for an RNN Language Model 10

Training of RNNLM ◮ The training is performed using Stochastic Gradient Descent (SGD) ◮ We go through all the training data iteratively, and update the weight matrices U , W and V (after processing every word) ◮ Training is performed in several “epochs” (usually 5-10) ◮ An epoch is one pass through the training data ◮ As with feedforward networks we have two passes: Forward pass : collect the values to make a prediction (for each time step) Backward pass : back-propagate the error gradients (through each time step) 11

Training of RNNLM Forward pass ◮ In the forward pass we compute a hidden state s ( t ) based on previous states 1 , . . . , t − 1 ◮ s ( t ) = f ( U xh · w ( t ) + W hh · s ( t − 1)) ◮ s ( t ) = f ( U xh · w ( t ) + W hh · f ( U xh · w ( t ) + W hh · s ( t − 2))) ◮ s ( t ) = f ( U xh · w ( t ) + W hh · f ( U xh · w ( t ) + W hh · f ( U xh · w ( t ) + W hh · s ( t − 3)))) ◮ etc. ◮ Let us assume f is linear, e.g. f ( x ) = x . ◮ Notice how we have to compute W hh · W hh · . . . = � i W hh ◮ By examining this repeated matrix multiplication we can show that the norm of W hh → ∞ (explodes) ◮ This is why f is set to a function that returns a bounded value (sigmoid / tanh) 12

Training of RNNLM Backward pass ◮ Gradient of the error vector in the output layer e o ( t ) is computed using a cross entropy criterion: e o ( t ) = d ( t ) − y ( t ) ◮ d ( t ) is a target vector that represents the word w ( t + 1) represented as a one-hot (1-of- V ) vector 13

Training of RNNLM Backward pass ◮ Weights V between the hidden layer s ( t ) and the output layer y ( t ) are updated as V ( t +1) = V ( t ) + s ( t ) · e o ( t ) · α ◮ where α is the learning rate 14

Training of RNNLM Backward pass ◮ Next, gradients of errors are propagated from the output layer to the hidden layer e h ( t ) = d h ( e o · V , t ) ◮ where the error vector is obtained using function d h () that is applied element-wise: d hj ( x , t ) = x · s j ( t )(1 − s j ( t )) 15

Training of RNNLM Backward pass ◮ Weights U between the input layer w ( t ) and the hidden layer s ( t ) are then updated as U ( t +1) = U ( t ) + w ( t ) · e h ( t ) · α ◮ Similarly the word embeddings w can also be updated using the error gradient. 16

Training of RNNLM: Backpropagation through time Backward pass ◮ The recurrent weights W are updated by unfolding them in time and training the network as a deep feedforward neural network. ◮ The process of propagating errors back through the recurrent weights is called Backpropagation Through Time (BPTT). 17

Training of RNNLM: Backpropagation through time Fig. from [1]: RNN unfolded as a deep feedforward network 3 time steps back in time 18

Training of RNNLM: Backpropagation through time Backward pass ◮ Error propagation is done recursively as follows (it requires the states of the hidden layer from the previous time steps τ to be stored): e ( t − τ − 1) = d h ( e h ( t − τ ) · W , t − τ − 1) ◮ The error gradients quickly vanish as they get backpropagated in time (less likely if we use sigmoid / tanh) ◮ We use gated RNNs to stop gradients from vanishing or exploding. ◮ Popular gated RNNs are long short-term memory RNNs aka LSTMs and gated recurrent units aka GRUs. 19

Training of RNNLM: Backpropagation through time Backward pass ◮ The recurrent weights W are updated as: T � W ( t +1) = W ( t ) + s ( t − z − 1) · e h ( t − z ) · α z =0 ◮ Note that the matrix W is changed in one update at once, not during backpropagation of errors. 20

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 4: Gated Recurrent Units 21

Interpolation for hidden units u : use history or forget history ◮ For RNN state s ( t ) ∈ R h create a binary vector u ∈ { 0 , 1 } h � 1 use the new hidden state (standard RNN update) u i = 0 copy previous hidden state and ignore RNN update ◮ Create an intermediate hidden state ˜ s ( t ) where f is tanh: s ( t ) = f ( U xh · w ( t ) + W hh · s ( t − 1)) ˜ ◮ Use the binary vector u to interpolate between copying prior state s ( t − 1) and using new state ˜ s ( t ): s ( t ) = (1 − u ) ⊙ s ( t − 1) + u ⊙ ˜ s ( t ) ⊙ is elementwise multiplication 22

Interpolation for hidden units r : reset or retain each element of hidden state vector ◮ For RNN state s ( t − 1) ∈ R h create a binary vector r ∈ { 0 , 1 } h � 1 if s i ( t − 1) should be used r i = 0 if s i ( t − 1) should be ignored ◮ Modify intermediate hidden state ˜ s ( t ) where f is tanh: ˜ s ( t ) = f ( U xh · w ( t ) + W hh · ( r ⊙ s ( t − 1))) ◮ Use the binary vector u to interpolate between s ( t − 1) and ˜ s ( t ): s ( t ) = (1 − u ) ⊙ s ( t − 1) + u ⊙ ˜ s ( t ) 23

Interpolation for hidden units Learning u and r ◮ Instead of binary vectors u ∈ { 0 , 1 } h and r ∈ { 0 , 1 } h we want to learn u and r ◮ Let u ∈ [0 , 1] h and r ∈ [0 , 1] h ◮ Learn these two h dimensional vectors using equations similar to the RNN hidden state equation: σ ( U u xh · w ( t ) + W u u ( t ) = hh · s ( t − 1)) σ ( U r xh · w ( t ) + W r r ( t ) = hh · s ( t − 1)) ◮ The sigmoid function σ ensures that each element of u and r is between [0 , 1] ◮ The use history u and reset element r vectors use different parameters U u , W u and U r , W r 24

Interpolation for hidden units Gated Recurrent Unit (GRU) ◮ Putting it all together: σ ( U u xh · w ( t ) + W u u ( t ) = hh · s ( t − 1)) σ ( U r xh · w ( t ) + W r r ( t ) = hh · s ( t − 1)) ˜ s ( t ) = tanh( U xh · w ( t ) + W hh · ( r ( t ) ⊙ s ( t − 1))) s ( t ) = (1 − u ( t )) ⊙ s ( t − 1) + u ( t ) ⊙ ˜ s ( t ) 25

Natural Language Processing Anoop Sarkar - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1:

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Unsupervised Learning of Object Deformation Models Iasonas Kokkinos and Alan Yuille Center for

31) Feature Models and MDA for Product Lines 1. Feature Models 2. Product Linie Configuration with

Introduction: What is Image Processing? CS 4640: Image Processing Basics January 10, 2012 What

Robotic Navigation - Experience Gained with RADAR Martin Adams Dept. Electrical Engineering, AMTC

Cryptanalysis of MORUS (Initially discussed at Lorentz center in Mar 2018) Tomer Ashur

Monte Carlo Methods and Neural Networks Alexander Keller, partially joint work with Noah Gamboa

Viterbi Algorithm Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming Pipelining & Retiming

Natural Language Processing Anoop Sarkar - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1:

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Unsupervised Learning of Object Deformation Models Iasonas Kokkinos and Alan Yuille Center for

31) Feature Models and MDA for Product Lines 1. Feature Models 2. Product Linie Configuration with

Introduction: What is Image Processing? CS 4640: Image Processing Basics January 10, 2012 What

Robotic Navigation - Experience Gained with RADAR Martin Adams Dept. Electrical Engineering, AMTC

Cryptanalysis of MORUS (Initially discussed at Lorentz center in Mar 2018) Tomer Ashur

Monte Carlo Methods and Neural Networks Alexander Keller, partially joint work with Noah Gamboa

Viterbi Algorithm Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Lecture 2 (I ): Lecture 2 (I ): Pipelining &amp; Retiming Pipelining &amp; Retiming

Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming Pipelining & Retiming