NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig - PowerPoint PPT Presentation

NLP Programming Tutorial 8 – Recurrent Neural Nets NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and Technology (NAIST) 1

NLP Programming Tutorial 8 – Recurrent Neural Nets Feed Forward Neural Nets ● All connections point forward ϕ ( x ) y ● It is a directed acyclic graph (DAG) 2

NLP Programming Tutorial 8 – Recurrent Neural Nets Recurrent Neural Nets (RNN) ● Part of the node outputs return as input h t − 1 ϕ t ( x ) y ● Why? It is possible to “memorize” 3

NLP Programming Tutorial 8 – Recurrent Neural Nets RNN in Sequence Modeling y 1 y 2 y 3 y 4 NET NET NET NET x 1 x 2 x 3 x 4 4

NLP Programming Tutorial 8 – Recurrent Neural Nets Example: POS Tagging JJ NN NN VBZ NET NET NET NET natural language processing is 5

NLP Programming Tutorial 8 – Recurrent Neural Nets Multi-class Prediction with Neural Networks 6

NLP Programming Tutorial 8 – Recurrent Neural Nets Review: Prediction Problems Given x, predict y A book review Is it positive? Binary Oh, man I love this book! Prediction yes (2 choices) This book is so boring... no A tweet Its language Multi-class On the way to the park! English Prediction (several choices) Japanese 公園に行くなう！ A sentence Its syntactic parse S Structured VP Prediction I read a book NP (millions of choices) N VBD DET NN 7 I read a book

NLP Programming Tutorial 8 – Recurrent Neural Nets Review: Sigmoid Function ● The sigmoid softens the step function w ⋅ ϕ ( x ) P ( y = 1 ∣ x )= e w ⋅ϕ( x ) 1 + e Sigmoid Function Step Function 1 1 p(y|x) 0.5 p(y|x) 0.5 0 0 -10 -5 0 5 10 -10 -5 0 5 10 w*phi(x) w*phi(x) 8

NLP Programming Tutorial 8 – Recurrent Neural Nets softmax Function ● Sigmoid function for multiple classes e w ⋅ ϕ ( x , y ) Current class P ( y ∣ x )= w ⋅ϕ( x, ~ ∑ ~ y ) y e Sum of other classes ● Can be expressed using matrix/vector ops r = e x p ( W ⋅ ϕ ( x )) r ∈ r ~ p = r / ∑ ~ r 9

NLP Programming Tutorial 8 – Recurrent Neural Nets Selecting the Best Value from a Probability Distribution ● Find the index y with the highest probability find_best ( p ): y = 0 for each element i in 1 .. len ( p )-1 : if p [ i ] > p [ y ]: y = i return y 10

NLP Programming Tutorial 8 – Recurrent Neural Nets softmax Function Gradient ● The difference between the true and estimated probability distributions − d err / d ϕ out = p' − p ● The true distribution p' is expressed with a vector with only the y-th element 1 (a one-hot vector) p' ={ 0,0, … , 1, … , 0 } 11

NLP Programming Tutorial 8 – Recurrent Neural Nets Creating a 1-hot Vector create_one_hot ( id, size ): vec = np.zeros( size ) vec [ id ] = 1 return vec 12

NLP Programming Tutorial 8 – Recurrent Neural Nets Forward Propagation in Recurrent Nets 13

NLP Programming Tutorial 8 – Recurrent Neural Nets Review: Forward Propagation Code forward_nn ( network, φ 0 ) φ = [ φ 0 ] # Output of each layer for each layer i in 1 .. len (network) : w , b = network [i-1] # Calculate the value based on previous layer φ [i] = np.tanh( np.dot( w , φ [i-1] ) + b ) return φ # Return the values of all layers 14

NLP Programming Tutorial 8 – Recurrent Neural Nets RNN Calculation p t p t+1 b o b o 1 softmax 1 softmax w o,h w o,h w r,h w r,h h t-1 h t h t+1 w r,x w r,x tanh tanh x t x t+1 b r b r 1 1 h t = t a n h ( w r ,h ⋅ h t − 1 + w r ,x ⋅ x t + b r ) p t = softmax ( w o,h ⋅ h t + b o ) 15

NLP Programming Tutorial 8 – Recurrent Neural Nets RNN Forward Calculation forward_rnn ( w r,x , w r,h , b r , w o,h , b o , x ) h = [ ] # Hidden layers (at time t) p = [ ] # Output probability distributions (at time t) y = [ ] # Output values (at time t) for each time t in 0 .. len ( x )-1 : if t > 0: h [t] = tanh( w r,x x [t] + w r,h h [t-1] + b r ) else : h [t] = tanh( w r,x x [t] + b r ) p [t] = tanh( w o,h h [t] + b o ) y [t] = find_max( p [t] ) return h , p , y 16

NLP Programming Tutorial 8 – Recurrent Neural Nets Review: Back Propagation in Feed-forward Nets 17

NLP Programming Tutorial 8 – Recurrent Neural Nets Stochastic Gradient Descent ● Online training algorithm for probabilistic models (including logistic regression) w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw ● In other words ● For every training example, calculate the gradient (the direction that will increase the probability of y) ● Move in that direction, multiplied by learning rate α 18

NLP Programming Tutorial 8 – Recurrent Neural Nets Gradient of the Sigmoid Function ● Take the derivative of the probability 0.4 dp(y|x)/dw*phi(x) w ⋅ ϕ ( x ) d d e 0.3 d w P ( y = 1 ∣ x ) = d w w ⋅ϕ( x ) 0.2 1 + e 0.1 w ⋅ϕ( x ) e 0 ϕ ( x ) = -10 -5 0 5 10 w ⋅ϕ( x ) ) 2 ( 1 + e w*phi(x) w ⋅ ϕ ( x ) d d w ( 1 − e d d w P ( y =− 1 ∣ x ) w ⋅ϕ( x ) ) = 1 + e w ⋅ϕ( x ) e − ϕ ( x ) = w ⋅ϕ( x ) ) 2 ( 1 + e 19

NLP Programming Tutorial 8 – Recurrent Neural Nets Learning: Don't Know Derivative for Hidden Units! ● For NNs, only know correct tag for last layer h ( x ) w 1 d P ( y = 1 ∣ x ) = ? d w 1 e w 4 ⋅ h ( x ) d P ( y = 1 ∣ x ) = h ( x ) d w 4 w 4 ⋅ h ( x ) ) 2 ( 1 + e w 2 w 4 ϕ ( x ) y=1 d P ( y = 1 ∣ x ) = ? w 3 d w 2 d P ( y = 1 ∣ x ) = ? 20 d w 3

NLP Programming Tutorial 8 – Recurrent Neural Nets Answer: Back-Propogation ● Calculate derivative w/ chain rule d w 4 h ( x ) d h 1 ( x ) d P ( y = 1 ∣ x ) = d P ( y = 1 ∣ x ) d w 1 d w 1 d w 4 h ( x ) d h 1 ( x ) e w 4 ⋅ h ( x ) w 1,4 w 4 ⋅ h ( x ) ) 2 ( 1 + e Error of Weight Gradient of next unit (δ 4 ) this unit = d h i ( x ) d P ( y = 1 ∣ x ) In General d w i ∑ j δ j w i, j Calculate i based w i 21 on next units j :

NLP Programming Tutorial 8 – Recurrent Neural Nets Conceptual Picture ● Send errors back through the net δ 1 w 1 δ 4 w 2 w 4 ϕ ( x ) y δ 2 w 3 δ 3 22

NLP Programming Tutorial 8 – Recurrent Neural Nets Back Propagation in Recurrent Nets 23

NLP Programming Tutorial 8 – Recurrent Neural Nets What Errors do we Know? y 1 δ o,1 y 2 δ o,2 y 3 δ o,3 y 4 δ o,4 δ r,1 δ r,2 δ r,3 NET NET NET NET x 1 x 2 x 3 x 4 ● We know the output errors δ o ● Must use back-prop to find recurrent errors δ r 24

NLP Programming Tutorial 8 – Recurrent Neural Nets How to Back-Propagate? ● Standard back propagation through time (BPTT) ● For each δ o , calculate n steps of δ r ● Full gradient calculation ● Use dynamic programming to calculate the whole sequence 25

NLP Programming Tutorial 8 – Recurrent Neural Nets Back Propagation through Time y 1 y 2 y 3 y 4 δ o,4 δ o,3 δ o,1 δ o,2 δ δ δ NET NET NET NET δ δ δ δ x 1 x 2 x 3 x 4 ● Use only one output error ● Stop after n steps (here, n=2) 26

NLP Programming Tutorial 8 – Recurrent Neural Nets Full Gradient Calculation y 1 y 2 y 3 y 4 δ o,4 δ o,1 δ o,2 δ o,3 δ δ δ δ NET NET NET NET x 1 x 2 x 3 x 4 ● First, calculate whole net result forward ● Then, calculate result backwards 27

NLP Programming Tutorial 8 – Recurrent Neural Nets BPTT? Full Gradient? ● Full gradient: ● + Faster, no time limit ● - Must save the result of the whole sequence in memory ● BPTT: ● + Only remember the results in the past few steps ● - Slower, less accurate for long dependencies 28

NLP Programming Tutorial 8 – Recurrent Neural Nets Vanishing Gradient in Neural Nets y 1 y 2 y 3 y 4 δ o,4 very tiny small med. tiny δ δ δ δ NET NET NET NET x 1 x 2 x 3 x 4 ● “Long Short Term Memory” is designed to solve this 29

NLP Programming Tutorial 8 – Recurrent Neural Nets RNN Full Gradient Calculation gradient_rnn ( w r,x , w r,h , b r , w o,h , b o , x , h , p , y' ) initialize Δ w r,x , Δ w r,h , Δ b r , Δ w o,h , Δ b o δ r ' = np.zeros(len( b r )) # Error from the following time step for each time t in len (x)-1 .. 0 : p' = create_one_hot( y' [t]) δ o ' = p' – p [t] # Output error Δ w o,h += np.outer( h [t], δ o ' ); Δ b o += δ o ' # Output gradient δ r = np.dot( δ' r , w r,h ) + np.dot( δ' o , w o,h ) # Backprop δ' r = δ r * ( 1 – h [t] 2 ) # tanh gradient Δ w r,x += np.outer( x [t], δ r ' ); Δ b r += δ r ' # Hidden gradient if t != 0: Δ w r,h += np.outer( h [t-1], δ r ' ); return Δ w r,x , Δ w r,h , Δ b r , Δ w o,h , Δ b o 30

NLP Programming Tutorial 8 – Recurrent Neural Nets Weight Update update_weights ( w r,x , w r,h , b r , w o,h , b o , Δ w r,x , Δ w r,h , Δ b r , Δ w o,h , Δ b o , λ ) w r,x += λ * Δ w r,x w r,h += λ * Δw r,h b r += λ * Δb r w o,h += λ * Δw o,h b o += λ * Δb o 31

NLP Programming Tutorial 8 – Recurrent Neural Nets Overall Training Algorithm # Create features create map x_ids, y_ids, array data for each labeled pair x, y in the data add ( create_ids ( x, x_ids ), create_ids ( y, y_ids ) ) to data initialize net randomly # Perform training for I iterations for each labeled pair x , y' in the feat_lab h , p , y = forward_rnn ( net , φ 0 ) Δ = gradient_rnn ( net , x , h , y' ) update_weights ( net , Δ , λ ) print net to weight_file print x_ids, y_ids to id_file 32

NLP Programming Tutorial 8 – Recurrent Neural Nets Exercise 33

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig - PowerPoint PPT Presentation

NLP Programming Tutorial 8 Recurrent Neural Nets NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and Technology (NAIST) 1 NLP Programming Tutorial 8 Recurrent Neural Nets Feed Forward Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

NLP Programming Tutorial 0 - Programming Basics Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig

NLP Programming Tutorial 2 - Bigram Language Models Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 1 - Unigram Language Models Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 8 - Phrase Structure Parsing Graham Neubig Nara Institute of Science

NLP Programming Tutorial 12 - Dependency Parsing Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of

CS4811 Neural Network Training Example Consider the following network. It has two inputs (two

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

Lecture 12: Computational Graph Backpropagation Aykut Erdem March 2016 Hacettepe

Feedforward neural nets CSE 250B Outline 1 Architecture 2 Expressivity 3 Learning The

Logistic Regression INFO-4604, Applied Machine Learning University of Colorado Boulder September

Anartificialneuron Artificialneuralnetworks y = f ( S ) x 0 =+1 Background