DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent - PowerPoint PPT Presentation

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio Corro

LECTURE 1 RECALL Language modeling with a multi-layer perceptron n ∏ 2nd order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y 2 | y 1 ) p ( y i | y i − 1 , y i − 2 ) i =3 x = [ exp( w y i ) Embedding of y i − 2 ] Embedding of y i − 1 z = σ ( U (1) x + b (1) ) p ( y i | y i − 1 , y i − 2 ) = w = U (2) z + b (2) ∑ y ′ � exp( w y ′ � ) Probability Hidden Output Concatenate the distribution representation projection embeddings of the two previous words

LECTURE 1 RECALL Language modeling with a multi-layer perceptron n ∏ 2nd order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y 2 | y 1 ) p ( y i | y i − 1 , y i − 2 ) i =3 x = [ exp( w y i ) Embedding of y i − 1 Embedding of y i − 2 ] z = σ ( U (1) x + b (1) ) p ( y i | y i − 1 , y i − 2 ) = w = U (2) z + b (2) ∑ y ′ � exp( w y ′ � ) Sentence classification with a Convolutional Neural Network 1. Convolution: sliding window of fixed size of the input sentence 2. Mean/max pooling over convolution outputs 3. Multi-linear perceptron

LECTURE 1 RECALL Language modeling with a multi-layer perceptron n ∏ 2nd order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y 2 | y 1 ) p ( y i | y i − 1 , y i − 2 ) i =3 x = [ exp( w y i ) Embedding of y i − 1 Embedding of y i − 2 ] z = σ ( U (1) x + b (1) ) p ( y i | y i − 1 , y i − 2 ) = w = U (2) z + b (2) ∑ y ′ � exp( w y ′ � ) Sentence classification with a Convolutional Neural Network 1. Convolution: sliding window of fixed size of the input sentence 2. Mean/max pooling over convolution outputs 3. Multi-linear perceptron Main issue ➤ These 2 networks only use local word-order information ➤ No long range dependencies

LONG RANGE DEPENDENCIES Today Recurrent neural networks ➤ Inputs are fed sequentially ➤ State representation updated at each input The dog is eating

LONG RANGE DEPENDENCIES Today Recurrent neural networks ➤ Inputs are fed sequentially ➤ State representation updated at each input The dog is eating Next week! Attention network ➤ Inputs contain position information ➤ At each position look at any input in the sentence The.1 dog.2 is.3 eating.4

RECURRENT NEURAL NETWORK Recurrent neural network cell Output h ( n ) h ( n ) Incoming recurrent Outgoing recurrent r ( n − 1) r ( n ) connection connection x ( n ) x ( n ) Input

RECURRENT NEURAL NETWORK Recurrent neural network cell Output h ( n ) h ( n ) Incoming recurrent Outgoing recurrent r ( n − 1) r ( n ) connection connection x ( n ) x ( n ) Input Dynamic neural network All cells share the h (1) h (2) h (3) h (4) same parameters The dog is eating

LANGUAGE MODEL Why do we usually make independence assumptions? ➤ Less parameters to learn ➤ Less sparsity | V | × | V | parameters Non neural language model n ∏ ➤ 1st order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y i | y i − 1 ) i =2 n ∏ ➤ 2nd order Markov chain: p ( y 2 | y 1 ) p ( y i | y i − 1 , y i − 2 ) p ( y 1 , . . . , y n ) = p ( y 1 ) i =3 | V | × | V | × | V | parameters Multi-layer perceptron language model ➤ No sparsity issue thanks to word embeddings ➤ Independence assumption, so no long range dependencies

LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS p ( y 1 . . . y n ) = p ( y 1 , . . . , y n − 1 ) p ( y n | y 1 , . . . , y n − 1 ) No independence assumption!

LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS p ( y 1 . . . y n ) = p ( y 1 , . . . , y n − 1 ) p ( y n | y 1 , . . . , y n − 1 ) No independence assumption! p ( y 1 ) <BOS>

LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS p ( y 1 . . . y n ) = p ( y 1 , . . . , y n − 1 ) p ( y n | y 1 , . . . , y n − 1 ) No independence assumption! p ( y 2 | y 1 ) p ( y 1 ) <BOS> <BOS> The

LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS p ( y 1 . . . y n ) = p ( y 1 , . . . , y n − 1 ) p ( y n | y 1 , . . . , y n − 1 ) No independence assumption! p ( y 2 | y 1 ) p ( y 3 | y 1 , y 2 ) p ( y 1 ) <BOS> <BOS> <BOS> The dog The

LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS p ( y 1 . . . y n ) = p ( y 1 , . . . , y n − 1 ) p ( y n | y 1 , . . . , y n − 1 ) No independence assumption! p ( y 2 | y 1 ) p ( y 3 | y 1 , y 2 ) p ( y 1 ) <BOS> <BOS> <BOS> The dog The p ( y 4 | y 1 , y 2 , y 3 ) <BOS> is The dog

SENTENCE CLASSIFICATION Neural architecture 1. A recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. A multi-layer perceptron takes as input this representation and output class weights

SENTENCE CLASSIFICATION Neural architecture 1. A recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. A multi-layer perceptron takes as input this representation and output class weights 1 Context sensitive representation z (1) The dog is eating

SENTENCE CLASSIFICATION Neural architecture 1. A recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. A multi-layer perceptron takes as input this representation and output class weights 1 2 MLP hidden layer Context sensitive representation z (2) = σ ( U (1) z (1) + b (1) ) z (1) w = U (2) z (2) + b (2) Output weights The dog is eating

MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation,   word after word Conditional language model

MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation,   word after word Conditional language model 1 z The dog is running

MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation,   word after word Conditional language model 1 2 z le The dog is running <BOS> Begin of sentence

MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation,   word after word Conditional language model 1 2 z le chien The dog is running <BOS> le Begin of sentence

MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation,   word after word Conditional language model 1 2 z le chien court The dog is running <BOS> le chien Begin of sentence

MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation,   word after word Conditional language model Stop translation when the end of sentence token is generated 1 2 z le chien court <EOS> The dog is running <BOS> le chien court Begin of sentence

SIMPLE RECURRENT NEURAL NETWORK

MULTI-LAYER PERCEPTRON RECURRENT NETWORK Multi-linear perceptron cell ➤ Input: the current word and the previous output ➤ Output: the hidden representation The recurrent connection is juste the output at each position h (1) h (2) h (3) h (4) h h (4) h word The dog is eating h ( n ) = tanh ( U [ h ( n − 1) ] + b ) x ( n )

GRADIENT BASED LEARNING PROBLEM Does it work? ➤ In theory: yes ➤ In practice: no, gradient based learning of RNN fail to learn long range dependencies! h (11) h (3) h (4) h (1) h (2) … … by my friend , is The dog , I was told Di ffi culties to propagate influence

GRADIENT BASED LEARNING PROBLEM Does it work? ➤ In theory: yes ➤ In practice: no, gradient based learning of RNN fail to learn long range dependencies! h (11) h (3) h (4) h (1) h (2) … … by my friend , is The dog , I was told Di ffi culties to propagate influence Deep learning is not a « single tool fits all problem » solution ➤ You need to understand your data and prediction task ➤ You need to understand why a given neural architecture may fail for a given task ➤ You need to be able design tailored neural architectures for a given task

LONG SHORT-TERM MEMORY NETWORKS

LONG SHORT-TERM MEMORY NETWORKS (LSTM) Memory vector Intuition c ➤ Memory vector which is passed along the sequence ➤ At each time step, the network selects which cell of the memory to modify The network can learn to keep track of long distance relationships LSTM cell ➤ The recurrent connection pass the memory vector to the next cell h h , c x

ERASING/WRITING VALUES IN A VECTOR Erasing values in the memory 3.02 0 − 4.11 0 ⇒ « Forget » the first 21.00 21.00 two cells 4.44 4.44 − 6.9 − 6.9

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent - PowerPoint PPT Presentation

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio Corro LECTURE 1 RECALL Language modeling with a multi-layer perceptron n 2nd order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y 2 |

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre

Machine Learning for NLP Sequential NN models Aurlie Herbelot 2019 Centre for Mind/Brain

Recurrent machines for likelihood-free inference Arthur Pesah Antoine Wehenkel Gilles Louppe

CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

PixelCNN Models with Auxiliary Variables for Natural Image Modeling Alexander Kolesnikov*,

Convolutional and recurrent neural networks Benoit Favre < benoit.favre@univ-mrs.fr >

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Document Modeling with Gated Recurrent Neural Network for Sentiment Classification Duyu Tang,

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent - PowerPoint PPT Presentation

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio Corro LECTURE 1 RECALL Language modeling with a multi-layer perceptron n 2nd order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y 2 |

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre

Machine Learning for NLP Sequential NN models Aurlie Herbelot 2019 Centre for Mind/Brain

Recurrent machines for likelihood-free inference Arthur Pesah Antoine Wehenkel Gilles Louppe

CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

PixelCNN Models with Auxiliary Variables for Natural Image Modeling Alexander Kolesnikov*,

Convolutional and recurrent neural networks Benoit Favre &lt; benoit.favre@univ-mrs.fr &gt;

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Document Modeling with Gated Recurrent Neural Network for Sentiment Classification Duyu Tang,

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Convolutional and recurrent neural networks Benoit Favre < benoit.favre@univ-mrs.fr >