IN5550 Neural Methods in Natural Language Processing Recurrent - PowerPoint PPT Presentation

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen University of Oslo March 10, 2019

Our Roadmap Today ◮ Language structure: sequences, trees, graphs ◮ Recurrent Neural Networks ◮ Different types of sequence labeling ◮ Learning to forget: Gated RNNs Next Week ◮ RNNs for structured prediction ◮ Recursive RNN variants ◮ A Selection of RNN applications Later ◮ Contextualized embeddings and transfer learning ◮ Conditioned generation and attention ◮ A CNN & RNN marriage: transformer archtectures 2

Recap: CNN Pros and Cons ◮ Can learn to represent large n -grams efficiently, ◮ without blowing up the parameter space and without having to represent the whole vocabulary. Parameter sharing. ◮ Easily parallelizable: each ‘region’ that a convolutional filter operates on is independent of the others; the entire input can be processed concurrently. (Each filter also independent.) ◮ The cost of this is that we have to stack convolutions into deep layers in order to ‘view’ the entire input, and each of those layers is in fact calculated independently. ◮ Not designed for modeling sequential language data: does not offer a very natural way of modeling long-range and structured dependencies. 3

But Language is So Rich in Structure A similar technique is almost impossible to apply to other crops. 4

But Language is So Rich in Structure A similar technique is almost impossible to apply to other crops. http://mrp.nlpl.eu/index.php?page=2 4

Okay, Maybe Start with Somewhat Simpler Structures A similar technique is almost impossible to apply to other crops. 5

Okay, Maybe Start with Somewhat Simpler Structures A similar technique is almost impossible to apply to other crops. root punct nsubj obl det cop ccomp case amod advmod mark amod A similar technique is almost impossible to apply to other crops . http://epe.nlpl.eu/index.php?page=1 5

Okay, Maybe Start with Somewhat Simpler Structures A similar technique is almost impossible to apply to other crops. root punct nsubj obl det cop ccomp case amod advmod mark amod A similar technique is almost impossible to apply to other crops . http://epe.nlpl.eu/index.php?page=1 DET ADJ NOUN AUX ADV ADJ PART VERB ADP ADJ NOUN PUNCT A similar technique is almost impossible to apply to other crops . 5

Recurrent Neural Networks in the Abstract ◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions 6

Recurrent Neural Networks in the Abstract ◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions ◮ map input sequence x 1: n to output y 1: n ◮ internal state sequence s 1: n as ‘history’ 6

Recurrent Neural Networks in the Abstract ◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions ◮ map input sequence x 1: n to output y 1: n ◮ internal state sequence s 1: n as ‘history’ RNN ( x 1: n , s 0 ) = y 1: n = R ( s i − 1 , x i ) s i y i = O ( s i ) x i ∈ R d x ; y i ∈ R d y ; s i ∈ R f ( d y ) 6

Still High-Level: The RNN Abstraction Unrolled 7

Still High-Level: The RNN Abstraction Unrolled ◮ Each state s i and output y i depend on the full previous context, e.g. s 4 = R ( R ( R ( R ( x 1 , s o ) , x 2 ) , x 3 ) x 4 ) 7

Still High-Level: The RNN Abstraction Unrolled ◮ Each state s i and output y i depend on the full previous context, e.g. s 4 = R ( R ( R ( R ( x 1 , s o ) , x 2 ) , x 3 ) x 4 ) ◮ Functions R ( · ) and O ( · ) shared across time points; fewer parameters 7

Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) 8

Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation = R ( s i − 1 , x i ) = s i − 1 + x i s i = O ( s i ) = s i y i 8

Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation = R ( s i − 1 , x i ) = s i − 1 + x i s i = O ( s i ) = s i y i ◮ Does this maybe look familiar? 8

Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation = R ( s i − 1 , x i ) = s i − 1 + x i s i = O ( s i ) = s i y i ◮ Does this maybe look familiar? Merely a continuous bag of words 8

Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation = R ( s i − 1 , x i ) = s i − 1 + x i s i = O ( s i ) = s i y i ◮ Does this maybe look familiar? Merely a continuous bag of words ◮ order-insensitive: Cisco acquired Tandberg ≡ Tandberg acquired Cisco 8

Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation = R ( s i − 1 , x i ) = s i − 1 + x i s i = O ( s i ) = s i y i ◮ Does this maybe look familiar? Merely a continuous bag of words ◮ order-insensitive: Cisco acquired Tandberg ≡ Tandberg acquired Cisco ◮ actually has no parameters of it own: θ = {} ; thus, no learning ability 8

The ‘Simple’ RNN (Elman, 1990) ◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R ( · ) function needs to be determined during training 9

The ‘Simple’ RNN (Elman, 1990) ◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R ( · ) function needs to be determined during training The Elman RNN R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = = O ( s i ) = s i y i x i ∈ R d x ; s 1 , y i ∈ R d y ; W x ∈ R d x × d s ; W s ∈ R d s × d s ; b ∈ R d s 9

The ‘Simple’ RNN (Elman, 1990) ◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R ( · ) function needs to be determined during training The Elman RNN R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = = O ( s i ) = s i y i x i ∈ R d x ; s 1 , y i ∈ R d y ; W x ∈ R d x × d s ; W s ∈ R d s × d s ; b ∈ R d s ◮ Linear transformations of states and inputs; non-linear activation ◮ alternative, equivalent definition of R ( · ) : s i = g ([ s i − 1 ; x i ] W + b ) 9

Training Recurrent Neural Networks ◮ Embed RNN in end-to-end task, e.g. classification from output states y i 10

Training Recurrent Neural Networks ◮ Embed RNN in end-to-end task, e.g. classification from output states y i ◮ standard loss functions, backpropagation, optimizers (so-called BPTT) 10

An Alternate Training Regime 11

An Alternate Training Regime ◮ Focus on final output state: y n as encoding of full sequence x 1: n ◮ looking familiar? 11

An Alternate Training Regime ◮ Focus on final output state: y n as encoding of full sequence x 1: n ◮ looking familiar? map variable-length sequence to fixed-size vector 11

An Alternate Training Regime ◮ Focus on final output state: y n as encoding of full sequence x 1: n ◮ looking familiar? map variable-length sequence to fixed-size vector ◮ sentence-level classification; or as input to conditioned generator ◮ aka sequence–to–sequence model; e.g. translation or summarization 11

Unrolled RNNs, in a Sense, are very Deep MLPs 12

Unrolled RNNs, in a Sense, are very Deep MLPs R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = 12

Unrolled RNNs, in a Sense, are very Deep MLPs R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = g ( g ( s i − 2 W s + x i − 1 W x + b ) W s + x i W x + b ) = 12

Unrolled RNNs, in a Sense, are very Deep MLPs R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = g ( g ( s i − 2 W s + x i − 1 W x + b ) W s + x i W x + b ) = ◮ W s , W x shared across all layers → exploding or vanishing gradients 12

Variants: Bi-Directional Recurrent Networks 13

Variants: Bi-Directional Recurrent Networks ◮ Capture full left and right context: ‘history’ and ‘future’ for each x i ◮ moderate increase in parameters (double); still linear-time computation 13

Variants: ‘Deep’ (Stacked) Recurrent Networks 14

A Note on Archicture Design While it is not theoretically clear what is the additional power gained by the deeper architectures, it was observed empirically that deep RNNs work better than shallower ones on some tasks. [...] Many works report results using layered RNN architectures, but do not ex- plicitly compare to one-layer RNNs. In the experiments of my research group, using two or more layers indeed often improves over using a single one. (Goldberg, 2017, p. 172) 15

IN5550 Neural Methods in Natural Language Processing Recurrent - PowerPoint PPT Presentation

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen University of Oslo March 10, 2019 Our Roadmap Today Language structure: sequences, trees, graphs Recurrent Neural Networks Different

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks Erik

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks (2:2)

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks Erik

IN5550: Neural Methods in Natural Language Processing Introduction Jeremy Barnes, Andrey

IN5550: Neural Methods in Natural Language Processing Introduction Jeremy Barnes, Andrey

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

IN5550: Neural Methods in Natural Language Processing Lecture 6 Evaluating Word Embeddings and

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

IN5550: Neural Methods in Natural Language Processing Lecture 4 Dense Representations of

IN5550 Neural Methods in Natural Language Processing Attention! Vinit Ravishankar

IN5550 Neural Methods in Natural Language Processing Home Exam: Task Overview and

IN5550 Neural Methods in Natural Language Processing Home Exam: Task Overview and

Breadth-first signature of trees and rational languages Victor Marsault, joint work with Jacques

The Analysis of Infinite-State Systems Bernard Boigelot Universit e de Li` ege

On transducers determinization Pierre-Alain Reynier Modelization and Verification team LIF,

Comparing nondeterministic and quasideterministic finite-state transducers built from

Implicit automata in typed -calculi Pierre PRADIC pierre.pradic@cs.ox.ac.uk j.w.w. NGUYN

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi Recall Viterbi search Viterbi

CPL 2016, week 4 Inter-thread communication Oleg Batrashev Institute of Computer Science, Tartu,

The Relational Database Engine: An Efficient Validator of T emporal Properties on Event T races