IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen University of Oslo March 10, 2019
Our Roadmap Today ◮ Language structure: sequences, trees, graphs ◮ Recurrent Neural Networks ◮ Different types of sequence labeling ◮ Learning to forget: Gated RNNs Next Week ◮ RNNs for structured prediction ◮ Recursive RNN variants ◮ A Selection of RNN applications Later ◮ Contextualized embeddings and transfer learning ◮ Conditioned generation and attention ◮ A CNN & RNN marriage: transformer archtectures 2
Recap: CNN Pros and Cons ◮ Can learn to represent large n -grams efficiently, ◮ without blowing up the parameter space and without having to represent the whole vocabulary. Parameter sharing. ◮ Easily parallelizable: each ‘region’ that a convolutional filter operates on is independent of the others; the entire input can be processed concurrently. (Each filter also independent.) ◮ The cost of this is that we have to stack convolutions into deep layers in order to ‘view’ the entire input, and each of those layers is in fact calculated independently. ◮ Not designed for modeling sequential language data: does not offer a very natural way of modeling long-range and structured dependencies. 3
But Language is So Rich in Structure A similar technique is almost impossible to apply to other crops. 4
But Language is So Rich in Structure A similar technique is almost impossible to apply to other crops. http://mrp.nlpl.eu/index.php?page=2 4
Okay, Maybe Start with Somewhat Simpler Structures A similar technique is almost impossible to apply to other crops. 5
Okay, Maybe Start with Somewhat Simpler Structures A similar technique is almost impossible to apply to other crops. root punct nsubj obl det cop ccomp case amod advmod mark amod A similar technique is almost impossible to apply to other crops . http://epe.nlpl.eu/index.php?page=1 5
Okay, Maybe Start with Somewhat Simpler Structures A similar technique is almost impossible to apply to other crops. root punct nsubj obl det cop ccomp case amod advmod mark amod A similar technique is almost impossible to apply to other crops . http://epe.nlpl.eu/index.php?page=1 DET ADJ NOUN AUX ADV ADJ PART VERB ADP ADJ NOUN PUNCT A similar technique is almost impossible to apply to other crops . 5
Recurrent Neural Networks in the Abstract ◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions 6
Recurrent Neural Networks in the Abstract ◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions ◮ map input sequence x 1: n to output y 1: n ◮ internal state sequence s 1: n as ‘history’ 6
Recurrent Neural Networks in the Abstract ◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions ◮ map input sequence x 1: n to output y 1: n ◮ internal state sequence s 1: n as ‘history’ RNN ( x 1: n , s 0 ) = y 1: n = R ( s i − 1 , x i ) s i y i = O ( s i ) x i ∈ R d x ; y i ∈ R d y ; s i ∈ R f ( d y ) 6
Still High-Level: The RNN Abstraction Unrolled 7
Still High-Level: The RNN Abstraction Unrolled ◮ Each state s i and output y i depend on the full previous context, e.g. s 4 = R ( R ( R ( R ( x 1 , s o ) , x 2 ) , x 3 ) x 4 ) 7
Still High-Level: The RNN Abstraction Unrolled ◮ Each state s i and output y i depend on the full previous context, e.g. s 4 = R ( R ( R ( R ( x 1 , s o ) , x 2 ) , x 3 ) x 4 ) ◮ Functions R ( · ) and O ( · ) shared across time points; fewer parameters 7
Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) 8
Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation = R ( s i − 1 , x i ) = s i − 1 + x i s i = O ( s i ) = s i y i 8
Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation = R ( s i − 1 , x i ) = s i − 1 + x i s i = O ( s i ) = s i y i ◮ Does this maybe look familiar? 8
Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation = R ( s i − 1 , x i ) = s i − 1 + x i s i = O ( s i ) = s i y i ◮ Does this maybe look familiar? Merely a continuous bag of words 8
Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation = R ( s i − 1 , x i ) = s i − 1 + x i s i = O ( s i ) = s i y i ◮ Does this maybe look familiar? Merely a continuous bag of words ◮ order-insensitive: Cisco acquired Tandberg ≡ Tandberg acquired Cisco 8
Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation = R ( s i − 1 , x i ) = s i − 1 + x i s i = O ( s i ) = s i y i ◮ Does this maybe look familiar? Merely a continuous bag of words ◮ order-insensitive: Cisco acquired Tandberg ≡ Tandberg acquired Cisco ◮ actually has no parameters of it own: θ = {} ; thus, no learning ability 8
The ‘Simple’ RNN (Elman, 1990) ◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R ( · ) function needs to be determined during training 9
The ‘Simple’ RNN (Elman, 1990) ◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R ( · ) function needs to be determined during training The Elman RNN R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = = O ( s i ) = s i y i x i ∈ R d x ; s 1 , y i ∈ R d y ; W x ∈ R d x × d s ; W s ∈ R d s × d s ; b ∈ R d s 9
The ‘Simple’ RNN (Elman, 1990) ◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R ( · ) function needs to be determined during training The Elman RNN R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = = O ( s i ) = s i y i x i ∈ R d x ; s 1 , y i ∈ R d y ; W x ∈ R d x × d s ; W s ∈ R d s × d s ; b ∈ R d s ◮ Linear transformations of states and inputs; non-linear activation ◮ alternative, equivalent definition of R ( · ) : s i = g ([ s i − 1 ; x i ] W + b ) 9
Training Recurrent Neural Networks ◮ Embed RNN in end-to-end task, e.g. classification from output states y i 10
Training Recurrent Neural Networks ◮ Embed RNN in end-to-end task, e.g. classification from output states y i ◮ standard loss functions, backpropagation, optimizers (so-called BPTT) 10
An Alternate Training Regime 11
An Alternate Training Regime ◮ Focus on final output state: y n as encoding of full sequence x 1: n ◮ looking familiar? 11
An Alternate Training Regime ◮ Focus on final output state: y n as encoding of full sequence x 1: n ◮ looking familiar? map variable-length sequence to fixed-size vector 11
An Alternate Training Regime ◮ Focus on final output state: y n as encoding of full sequence x 1: n ◮ looking familiar? map variable-length sequence to fixed-size vector ◮ sentence-level classification; or as input to conditioned generator ◮ aka sequence–to–sequence model; e.g. translation or summarization 11
Unrolled RNNs, in a Sense, are very Deep MLPs 12
Unrolled RNNs, in a Sense, are very Deep MLPs R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = 12
Unrolled RNNs, in a Sense, are very Deep MLPs R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = g ( g ( s i − 2 W s + x i − 1 W x + b ) W s + x i W x + b ) = 12
Unrolled RNNs, in a Sense, are very Deep MLPs R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = g ( g ( s i − 2 W s + x i − 1 W x + b ) W s + x i W x + b ) = ◮ W s , W x shared across all layers → exploding or vanishing gradients 12
Variants: Bi-Directional Recurrent Networks 13
Variants: Bi-Directional Recurrent Networks ◮ Capture full left and right context: ‘history’ and ‘future’ for each x i ◮ moderate increase in parameters (double); still linear-time computation 13
Variants: ‘Deep’ (Stacked) Recurrent Networks 14
A Note on Archicture Design While it is not theoretically clear what is the additional power gained by the deeper architectures, it was observed empirically that deep RNNs work better than shallower ones on some tasks. [...] Many works report results using layered RNN architectures, but do not ex- plicitly compare to one-layer RNNs. In the experiments of my research group, using two or more layers indeed often improves over using a single one. (Goldberg, 2017, p. 172) 15
Recommend
More recommend