Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai
Outline • Dependent Random Variables • Text Preprocessing • Language Modeling • Recurrent Neural Networks (RNN) • LSTM • Bidirectional RNN • Deep RNN d2l.ai
Dependent Random Variables d2l.ai
Time matters (Koren, 2009) Netflix changed the labels its rating system Yehuda Koren, 2009 d2l.ai
Time matters (Koren, 2009) Selection Bias Yehuda Koren, 2009 d2l.ai
Kahnemann & Krueger, 2006 d2l.ai
Kahnemann & Krueger, 2006 d2l.ai
rating rate cuts agencies Q2 earnings orange hair tweets TL;DR - Data usually isn’t IID Black Christmas inventory Friday back to prime day school d2l.ai
Data • So far … • Collect observation pairs for training ( x i , y i ) ∼ p ( x , y ) • Estimate for unseen x ′ � ∼ p ( x ) y | x ∼ p ( y | x ) • Examples • Images classification & objects recognition • Disease prediction • Housing price prediction • The order of the data does not matter d2l.ai
Text Processing d2l.ai
Text Preprocessing • Sequence data has long dependency (very costly) • Truncate into shorter fragments • Transform examples into mini-batches with ndarrays (batch size, width, height, channel) (batch size, sentence length) d2l.ai
Tokenization • Basic Idea - map text into sequence of tokens • “Deep learning is fun” -> [“Deep”, “learning”, “is”, “fun”, “.”] • Character Encoding (each character as a token) • Small vocabulary • Doesn’t work so well (needs to learn spelling) • Word Encoding (each word as a token) • Accurate spelling • Doesn’t work so well (huge vocabulary = costly multinomial) • Byte Pair Encoding (Goldilocks zone) • Frequent subsequences (like syllables) d2l.ai
Vocabulary • Find unique tokens, map each one into a numerical index • “Deep” : 1, “learning” : 2, “is” : 3, “fun” : 4, “.” : 5 • The frequency of words often obeys a power law distribution • Map the tailing tokens, e.g. appears < 5 times, into a special “unknown” token d2l.ai
Minibatch Generation d2l.ai
Text Preprocessing Notebook d2l.ai
Language Models d2l.ai
̂ Language Models • Tokens not real values (domain is countably finite) T ∏ p ( w 1 , w 2 , …, w T ) = p ( w 1 ) p ( w t | w 1 , …, w t − 1 ) t =2 • e.g., p (deep, learning, is, fun, . ) = p (deep) p (learning | deep) p (is | deep, learning) p (fun | deep, learning, is) p ( . | deep, learning, is, fun) • Estimating it Need Smoothing p (learning | deep) = n (deep, learning) n (deep) d2l.ai
Language Modeling • Goal: predict the probability of a sentence, e.g. p (Deep, learning, is, fun, . ) • NLP fundamental tasks • Typing - predict the next word • Machine translation - dog bites man vs man bites dog • Speech recognition to recognize speech vs to wreck a nice beach d2l.ai
Language Modeling • NLP fundamental tasks • Named-entity recognition • Part-of-speech tagging • Machine translation • Question answering • Automatic Summarization • … d2l.ai
Recurrent Neural Networks d2l.ai
RNN with Hidden States • Hidden State update • 2-layer MLP H t = ϕ ( W hh H t − 1 + W hx X t − 1 + b h ) H t = ϕ ( W hx X t − 1 + b h ) • Observation update o t = W ho H t + b o o t = W ho H t + b o d2l.ai
Next word prediction d2l.ai
Input Encoding • Need to map input numerical indices to vectors • Pick granularity (words, characters, subwords) • Map to indicator vectors d2l.ai
RNN with hidden state mechanics x 1 , …, x T ∈ ℝ d • Input: vector sequence h 1 , …, h T ∈ ℝ h where h t = f ( h t − 1 , x t ) • Hidden states: o 1 , …, o T ∈ ℝ p where o t = g ( h t ) • Output: vector sequence • p is the vocabulary size • is confident score that the t -th timestamp in the o t , j sequence equals to j -th token in the vocabulary • Loss: measure the classification error on T tokens d2l.ai
Gradient Clipping • Long chain of dependencies for backprop • Need to keep a lot of intermediate values in memory • Butterfly effect style dependencies • Gradients can vanish or explod • Clipping to prevent divergence g ← min ( 1, θ ∥ g ∥ ) g rescales to gradient of size at most θ d2l.ai
RNN Notebook d2l.ai
Paying attention to a sequence • Not all observations are equally relevant • d2l.ai
Paying attention to a sequence • Not all observations are equally relevant • • Need mechanism to pay attention (update gate) e.g., an early observation is highly significant for predicting all future observations. We would like to have some mechanism for storing/updaing vital early information in a memory cell. d2l.ai
Paying attention to a sequence • Not all observations are equally relevant • • Need mechanism to forget (reset gate) e.g., There is a logical break between parts of a sequence. For instance there might be a transition between chapters in a book, a transition between a bear and a bull market for securities, etc. d2l.ai
From RNN to GRU R t = σ ( X t W xr + H t − 1 W hr + b r ), Z t = σ ( X t W xz + H t − 1 W hz + b z ) H t = tanh ( X t W xh + ( R t ⊙ H t − 1 ) W hh + b h ) ˜ H t = ϕ ( W hh H t − 1 + W hx X t − 1 + b h ) H t = Z t ⊙ H t − 1 + (1 − Z t ) ⊙ ˜ H t o t = W ho H t + b o o t = W ho H t + b o d2l.ai
GRU - Gates R t = σ ( X t W xr + H t − 1 W hr + b r ), Z t = σ ( X t W xz + H t − 1 W hz + b z ) d2l.ai
GRU - Candidate Hidden State H t = tanh ( X t W xh + ( R t ⊙ H t − 1 ) W hh + b h ) ˜ d2l.ai
Hidden State H t = Z t ⊙ H t − 1 + (1 − Z t ) ⊙ ˜ H t d2l.ai
Summary R t = σ ( X t W xr + H t − 1 W hr + b r ), Z t = σ ( X t W xz + H t − 1 W hz + b z ) H t = tanh ( X t W xh + ( R t ⊙ H t − 1 ) W hh + b h ) ˜ H t = Z t ⊙ H t − 1 + (1 − Z t ) ⊙ ˜ H t d2l.ai
Long Short Term Memory d2l.ai
GRU and LSTM I t = σ ( X t W xi + H t − 1 W hi + b i ) F t = σ ( X t W xf + H t − 1 W hf + b f ) R t = σ ( X t W xr + H t − 1 W hr + b r ), O t = σ ( X t W xo + H t − 1 W ho + b o ) Z t = σ ( X t W xz + H t − 1 W hz + b z ) ˜ C t = tanh ( X t W xc + H t − 1 W hc + b c ) H t = tanh ( X t W xh + ( R t ⊙ H t − 1 ) W hh + b h ) ˜ C t = F t ⊙ C t − 1 + I t ⊙ ˜ H t = Z t ⊙ H t − 1 + (1 − Z t ) ⊙ ˜ C t H t H t = O t ⊙ tanh ( C t ) d2l.ai
Long Short Term Memory • Forget gate Reset the memory cell values • Input gate Decide whether we should ignore the input data • Output gate Decide whether the hidden state is used for the output generated by the LSTM • Hidden state and Memory cell d2l.ai
Gates I t = σ ( X t W xi + H t − 1 W hi + b i ) F t = σ ( X t W xf + H t − 1 W hf + b f ) O t = σ ( X t W xo + H t − 1 W ho + b o ) d2l.ai
Candidate Memory Cell ˜ C t = tanh ( X t W xc + H t − 1 W hc + b c ) d2l.ai
Memory Cell C t = F t ⊙ C t − 1 + I t ⊙ ˜ C t d2l.ai
Hidden State / Output H t = O t ⊙ tanh ( C t ) d2l.ai
Hidden State / Output I t = σ ( X t W xi + H t − 1 W hi + b i ) F t = σ ( X t W xf + H t − 1 W hf + b f ) O t = σ ( X t W xo + H t − 1 W ho + b o ) ˜ C t = tanh ( X t W xc + H t − 1 W hc + b c ) C t = F t ⊙ C t − 1 + I t ⊙ ˜ C t H t = O t ⊙ tanh ( C t ) d2l.ai
LSTM Notebook d2l.ai
Bidirectional RNNs d2l.ai
I am _____ I am _____ very hungry, I am _____ very hungry, I could eat half a pig. d2l.ai
I am hungry . I am not very hungry, I am very very hungry, I could eat half a pig. d2l.ai
The Future Matters I am happy . I am hungry . I am not very hungry, I am not very hungry, I am very very hungry, I could eat half a pig. I am very very hungry, I could eat half a pig. • Very different words to fill in, depending on past and future context of a word. • RNNs so far only look at the past • In interpolation (fill in) we can use the future, too. d2l.ai
Bidirectional RNN • One RNN forward • Another one backward • Combine both hidden states for output generation d2l.ai
Using RNNs Question Poetry Sentiment Named Answering Generation Analysis Entity Tagging Machine Document Translation Classification d2l.ai (image courtesy of karpathy.github.io)
Recall - RNNs Architecture • Hidden State update H t = ϕ ( W hh H t − 1 + W hx X t − 1 + b h ) How to make more nonlinear? • Observation update o t = W ho H t + b o d2l.ai
We go deeper d2l.ai
We go deeper • Shallow RNN • Input H t = f ( H t − 1 , X t ) • Hidden layer O t = g ( H t ) • Output • Deep RNN H 1 t = f 1 ( H 1 t − 1 , X t ) • Input t − 1 , H j − 1 H j t = f j ( H j ) • Hidden layer t • Hidden layer O t = g ( H L t ) … • Output d2l.ai
Summary • Dependent Random Variables • Text Preprocessing • Language Modeling • Recurrent Neural Networks (RNN) • LSTM • Bidirectional RNN • Deep RNN d2l.ai
Recommend
More recommend