IBM Model 1 (1993) f : vector of French words Le chat est sur la chaise verte (visualization of alignment) e : vector of English words The cat is on the green chair a : vector of alignment indices 0 1 2 3 4 6 5 Slides courtesy Rebecca Knowles
IBM Model 1 (1993) f : vector of French words Le chat est sur la chaise verte (visualization of alignment) e : vector of English words The cat is on the green chair a : vector of alignment indices 0 1 2 3 4 6 5 t ( f j | e i ) : translation probability of the word f j given the word e i Slides courtesy Rebecca Knowles
Model and Parameters Want : P( f | e ) But don’t know how to train this directly… Solution : Use P( a, f | e ), where a is an alignment Remember: Slides courtesy Rebecca Knowles
Model and Parameters: Intuition Translation prob. : Example : Interpretation : How probable is it that we see f j given e i Slides courtesy Rebecca Knowles
Model and Parameters: Intuition Alignment/translation prob. : Example (visual representation of a ): le chat le chat P( | “the cat”) < P( | “the cat”) the cat the cat Interpretation : How probable are the alignment a and the translation f (given e ) Slides courtesy Rebecca Knowles
Model and Parameters: Intuition Alignment prob. : Example: P( | “le chat”, “the cat”) < P( | “le chat”, “the cat”) Interpretation : How probable is alignment a (given e and f ) Slides courtesy Rebecca Knowles
Model and Parameters How to compute: Slides courtesy Rebecca Knowles
Parameters For IBM model 1, we can compute all parameters given translation parameters: How many of these are there? Slides courtesy Rebecca Knowles
Parameters For IBM model 1, we can compute all parameters given translation parameters: How many of these are there? | French vocabulary | x | English vocabulary | Slides courtesy Rebecca Knowles
Data Two sentence pairs: English French b c x y b y Slides courtesy Rebecca Knowles
All Possible Alignments x y x y (French: x, y) b c b c (English: b, c) Remember: y simplifying assumption that b each word must be aligned exactly once Slides courtesy Rebecca Knowles
Expectation Maximization (EM) 0. Assume some value for and compute other parameter values Two step, iterative algorithm 1. E-step: count alignments and translations under uncertainty, assuming these parameters le chat P( | “the cat”) le chat P( | “the cat”) 2. M-step: maximize log-likelihood (update parameters), using uncertain counts estimated counts Slides courtesy Rebecca Knowles
Review of IBM Model 1 & EM Iteratively learned an alignment/translation model from sentence-aligned text (without “gold standard” alignments) Model can now be used for alignment and/or word-level translation We explored a simplified version of this; IBM Model 1 allows more types of alignments Slides courtesy Rebecca Knowles
Why is Model 1 insufficient? Why won’t this produce great translations? Indifferent to order (language model may help?) Translates one word at a time Translates each word in isolation ... Slides courtesy Rebecca Knowles
Uses for Alignments Component of machine translation systems Produce a translation lexicon automatically Cross-lingual projection/extraction of information Supervision for training other models (for example, neural MT systems) Slides courtesy Rebecca Knowles
Evaluating Machine Translation Human evaluations: Test set (source, human reference translations, MT output) Humans judge the quality of MT output (in one of several possible ways) Koehn (2017), http://mt-class.org/jhu/slides/lecture-evaluation.pdf Slides courtesy Rebecca Knowles
Evaluating Machine Translation Many metrics: Automatic evaluations: TER (Translation Error/Edit Test set (source, human Rate) reference translations, HTER (Human-Targeted MT output) Translation Edit Rate) Aim to mimic (correlate BLEU (Bilingual Evaluation Understudy) with) human evaluations METEOR (Metric for Evaluation of Translation with Explicit Ordering) Slides courtesy Rebecca Knowles
Machine Translation Alignment Now Explicitly with fancier IBM models Implicitly/learned jointly with attention in recurrent neural networks (RNNs)
Outline Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch
Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 ) w i predict the next word
Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) = softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) w i predict the next word
Hidden Markov Model Representation 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 emission transition = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 probabilities/parameters probabilities/parameters 𝑗 … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph
A Different Model’s Representation … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph
A Different Model’s Representation 𝑞 𝑨 1 , 𝑨 2 , … , 𝑨 𝑂 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 , 𝑥 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 , 𝑥 𝑂 = ෑ 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 , 𝑥 𝑗 𝑗 … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph
A Different Model’s Representation 𝑞 𝑨 1 , 𝑨 2 , … , 𝑨 𝑂 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 , 𝑥 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 , 𝑥 𝑂 = ෑ 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 , 𝑥 𝑗 𝑗 𝑨 𝑗−1 , 𝑥 𝑗 ) ∝ exp( 𝜄 𝑈 𝑔 𝑥 𝑗 , 𝑨 𝑗−1 , 𝑨 𝑗 ) 𝑞 𝑨 𝑗 … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph
Maximum Entropy Markov Model (MEMM) A Different Model’s Representation 𝑞 𝑨 1 , 𝑨 2 , … , 𝑨 𝑂 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 , 𝑥 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 , 𝑥 𝑂 = ෑ 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 , 𝑥 𝑗 𝑗 𝑨 𝑗−1 , 𝑥 𝑗 ) ∝ exp( 𝜄 𝑈 𝑔 𝑥 𝑗 , 𝑨 𝑗−1 , 𝑨 𝑗 ) 𝑞 𝑨 𝑗 … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph
MEMMs … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 Discriminative: don’t care about generating observed sequence at all Maxent: use features Problem: Label-Bias problem
Label-Bias Problem z i w i
Label-Bias Problem z i 1 incoming mass must sum to 1 w i
Label-Bias Problem z i 1 1 incoming mass must outgoing mass must sum to 1 sum to 1 w i
Label-Bias Problem z i 1 1 incoming mass must outgoing mass must sum to 1 sum to 1 w i Take-aways: observe, but do not • the model can learn to generate (explain) the ignore observations observation • the model can get itself stuck on “bad” paths
Outline Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch
(Linear Chain) Conditional Random Fields … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 … Discriminative: don’t care about generating observed sequence at all Condition on the entire observed word sequence w 1 … w N Maxent: use features Solves the label-bias problem
(Linear Chain) Conditional Random Fields … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 … 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝑥 1 , … , 𝑥 𝑂 ) ∝ ෑ 𝑗
(Linear Chain) Conditional Random Fields … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 … 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ∝ ෑ 𝑗 condition on entire sequence
CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & ( 𝑥 𝑗−2 in list of adjectives or determiners))
CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & ( 𝑥 𝑗−2 in list of adjectives or determiners)) • NER f path p ( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = ( 𝑨 𝑗−1 == Per & 𝑨 𝑗 == Per & (syntactic path p involving 𝑥 𝑗 exists ))
CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = Can’t easily do these ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & with an HMM ( 𝑥 𝑗−2 in list of adjectives or determiners)) ➔ • NER Conditional models can allow richer f path p ( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = features ( 𝑨 𝑗−1 == Per & 𝑨 𝑗 == Per & (syntactic path p involving 𝑥 𝑗 exists ))
CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = Can’t easily do these ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & with an HMM We’ll cover syntactic ( 𝑥 𝑗−2 in list of adjectives or determiners)) ➔ paths next class • NER Conditional models can allow richer f path p ( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = features ( 𝑨 𝑗−1 == Per & 𝑨 𝑗 == Per & (syntactic path p involving 𝑥 𝑗 exists ))
CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = Can’t easily do these CRFs can be used in neural networks too: ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & with an HMM ( 𝑥 𝑗−2 in list of adjectives or determiners)) ➔ https://www.tensorflow.org/versions/r1.15/api_docs/python • NER Conditional models /tf/contrib/crf/CrfForwardRnnCell can allow richer f path p ( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = https://pytorch-crf.readthedocs.io/en/stable/ features ( 𝑨 𝑗−1 == Per & 𝑨 𝑗 == Per & (syntactic path p involving 𝑥 𝑗 exists ))
Conditional vs. Sequence We’ll cover these in 691: Graphical and Statistical Models of Learning CRF Tutorial, Fig 1.2, Sutton & McCallum (2012)
Outline Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch
Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f θ wi representations… product compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) = softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) w i predict the next word
A More Typical View of Recurrent Neural Language Modeling w i-2 w i-1 w i w i+1 h i-3 h i-2 h i-1 h i w i-3 w i-2 w i-1 w i
A More Typical View of Recurrent Neural Language Modeling w i-2 w i-1 w i w i+1 h i-3 h i-2 h i-1 h i w i-3 w i-2 w i-1 w i observe these words one at a time
A More Typical View of Recurrent Neural Language Modeling predict the next word w i-2 w i-1 w i w i+1 h i-3 h i-2 h i-1 h i w i-3 w i-2 w i-1 w i observe these words one at a time
A More Typical View of Recurrent Neural Language Modeling predict the next word w i-2 w i-1 w i w i+1 h i-3 h i-2 h i-1 h i from these hidden states w i-3 w i-2 w i-1 w i observe these words one at a time
A More Typical View of Recurrent Neural Language Modeling predict the next word “cell” w i-2 w i-1 w i w i+1 h i-3 h i-2 h i-1 h i from these hidden states w i-3 w i-2 w i-1 w i observe these words one at a time
A Recurrent Neural Network Cell w i w i+1 h i-1 h i w i-1 w i
A Recurrent Neural Network Cell w i w i+1 W W h i-1 h i w i-1 w i
A Recurrent Neural Network Cell w i w i+1 W W h i-1 h i U encoding U w i-1 w i
A Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i
A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 )
A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) 1 𝜏 𝑦 = 1 + exp(−𝑦)
A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) 1 𝜏 𝑦 = 1 + exp(−𝑦)
A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) 1 𝜏 𝑦 = 1 + exp(−𝑦)
A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) 1 𝜏 𝑦 = 1 + exp(−𝑦) 𝑥 𝑗+1 = softmax(𝑇ℎ 𝑗 ) ෝ
A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i must learn matrices U, S, W ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) 𝑥 𝑗+1 = softmax(𝑇ℎ 𝑗 ) ෝ
A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i must learn matrices U, S, W ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) suggested solution: gradient descent on prediction ability 𝑥 𝑗+1 = softmax(𝑇ℎ 𝑗 ) ෝ
A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i must learn matrices U, S, W ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) suggested solution: gradient descent on prediction ability 𝑥 𝑗+1 = softmax(𝑇ℎ 𝑗 ) ෝ problem: they’re tied across inputs/timesteps
A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i must learn matrices U, S, W ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) suggested solution: gradient descent on prediction ability problem: they’re tied across inputs/timesteps 𝑥 𝑗+1 = softmax(𝑇ℎ 𝑗 ) ෝ good news for you: many toolkits do this automatically
Why Is Training RNNs Hard? Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives
Why Is Training RNNs Hard? Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives Vanishing gradients Multiply the same matrices at each timestep ➔ multiply many matrices in the gradients
Why Is Training RNNs Hard? Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives Vanishing gradients Multiply the same matrices at each timestep ➔ multiply many matrices in the gradients One solution: clip the gradients to a max value
Outline Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch
Natural Language Processing from keras import * from torch import *
Pick Your Toolkit PyTorch Keras Deeplearning4j MxNet TensorFlow Gluon DyNet CNTK Caffe … Comparisons: https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software https://deeplearning4j.org/compare-dl4j-tensorflow-pytorch https://github.com/zer0n/deepframeworks (older---2015)
Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i-1 w i w i+1 h i-2 h i-1 h i w i-2 w i-1 w i
Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i-1 w i w i+1 h i-2 h i-1 h i w i-2 w i-1 w i
Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i-1 w i w i+1 h i-2 h i-1 h i w i-2 w i-1 w i
Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i-1 w i w i+1 h i-2 h i-1 h i w i-2 w i-1 w i encode
Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i+1 w i-1 w i h i-2 h i-1 h i w i-2 w i-1 w i decode
Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html Negative log- likelihood
Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i+1 w i-1 w i Negative log- likelihood h i-2 h i-1 h i w i-2 w i-1 w i get predictions
Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html Negative log- likelihood get predictions eval predictions
Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html Negative log- likelihood get predictions eval predictions compute gradient
Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html Negative log- likelihood get predictions eval predictions compute gradient perform SGD
Another Solution: LSTMs/GRUs LSTM: Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) GRU: Gated Recurrent Unit (Cho et al., 2014) forget line Basic Ideas: learn to forget http://colah.github.io/posts/2015-08-Understanding-LSTMs/ representation line
Recommend
More recommend