latent models
play

Latent Models: Sequence Models Beyond HMMs and Machine Translation - PowerPoint PPT Presentation

Latent Models: Sequence Models Beyond HMMs and Machine Translation Alignment CMSC 473/673 UMBC Outline Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields


  1. IBM Model 1 (1993) f : vector of French words Le chat est sur la chaise verte (visualization of alignment) e : vector of English words The cat is on the green chair a : vector of alignment indices 0 1 2 3 4 6 5 Slides courtesy Rebecca Knowles

  2. IBM Model 1 (1993) f : vector of French words Le chat est sur la chaise verte (visualization of alignment) e : vector of English words The cat is on the green chair a : vector of alignment indices 0 1 2 3 4 6 5 t ( f j | e i ) : translation probability of the word f j given the word e i Slides courtesy Rebecca Knowles

  3. Model and Parameters Want : P( f | e ) But don’t know how to train this directly… Solution : Use P( a, f | e ), where a is an alignment Remember: Slides courtesy Rebecca Knowles

  4. Model and Parameters: Intuition Translation prob. : Example : Interpretation : How probable is it that we see f j given e i Slides courtesy Rebecca Knowles

  5. Model and Parameters: Intuition Alignment/translation prob. : Example (visual representation of a ): le chat le chat P( | “the cat”) < P( | “the cat”) the cat the cat Interpretation : How probable are the alignment a and the translation f (given e ) Slides courtesy Rebecca Knowles

  6. Model and Parameters: Intuition Alignment prob. : Example: P( | “le chat”, “the cat”) < P( | “le chat”, “the cat”) Interpretation : How probable is alignment a (given e and f ) Slides courtesy Rebecca Knowles

  7. Model and Parameters How to compute: Slides courtesy Rebecca Knowles

  8. Parameters For IBM model 1, we can compute all parameters given translation parameters: How many of these are there? Slides courtesy Rebecca Knowles

  9. Parameters For IBM model 1, we can compute all parameters given translation parameters: How many of these are there? | French vocabulary | x | English vocabulary | Slides courtesy Rebecca Knowles

  10. Data Two sentence pairs: English French b c x y b y Slides courtesy Rebecca Knowles

  11. All Possible Alignments x y x y (French: x, y) b c b c (English: b, c) Remember: y simplifying assumption that b each word must be aligned exactly once Slides courtesy Rebecca Knowles

  12. Expectation Maximization (EM) 0. Assume some value for and compute other parameter values Two step, iterative algorithm 1. E-step: count alignments and translations under uncertainty, assuming these parameters le chat P( | “the cat”) le chat P( | “the cat”) 2. M-step: maximize log-likelihood (update parameters), using uncertain counts estimated counts Slides courtesy Rebecca Knowles

  13. Review of IBM Model 1 & EM Iteratively learned an alignment/translation model from sentence-aligned text (without “gold standard” alignments) Model can now be used for alignment and/or word-level translation We explored a simplified version of this; IBM Model 1 allows more types of alignments Slides courtesy Rebecca Knowles

  14. Why is Model 1 insufficient? Why won’t this produce great translations? Indifferent to order (language model may help?) Translates one word at a time Translates each word in isolation ... Slides courtesy Rebecca Knowles

  15. Uses for Alignments Component of machine translation systems Produce a translation lexicon automatically Cross-lingual projection/extraction of information Supervision for training other models (for example, neural MT systems) Slides courtesy Rebecca Knowles

  16. Evaluating Machine Translation Human evaluations: Test set (source, human reference translations, MT output) Humans judge the quality of MT output (in one of several possible ways) Koehn (2017), http://mt-class.org/jhu/slides/lecture-evaluation.pdf Slides courtesy Rebecca Knowles

  17. Evaluating Machine Translation Many metrics: Automatic evaluations: TER (Translation Error/Edit Test set (source, human Rate) reference translations, HTER (Human-Targeted MT output) Translation Edit Rate) Aim to mimic (correlate BLEU (Bilingual Evaluation Understudy) with) human evaluations METEOR (Metric for Evaluation of Translation with Explicit Ordering) Slides courtesy Rebecca Knowles

  18. Machine Translation Alignment Now Explicitly with fancier IBM models Implicitly/learned jointly with attention in recurrent neural networks (RNNs)

  19. Outline Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch

  20. Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 ) w i predict the next word

  21. Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) = softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) w i predict the next word

  22. Hidden Markov Model Representation 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 emission transition = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 probabilities/parameters probabilities/parameters 𝑗 … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph

  23. A Different Model’s Representation … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph

  24. A Different Model’s Representation 𝑞 𝑨 1 , 𝑨 2 , … , 𝑨 𝑂 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 , 𝑥 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 , 𝑥 𝑂 = ෑ 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 , 𝑥 𝑗 𝑗 … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph

  25. A Different Model’s Representation 𝑞 𝑨 1 , 𝑨 2 , … , 𝑨 𝑂 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 , 𝑥 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 , 𝑥 𝑂 = ෑ 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 , 𝑥 𝑗 𝑗 𝑨 𝑗−1 , 𝑥 𝑗 ) ∝ exp( 𝜄 𝑈 𝑔 𝑥 𝑗 , 𝑨 𝑗−1 , 𝑨 𝑗 ) 𝑞 𝑨 𝑗 … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph

  26. Maximum Entropy Markov Model (MEMM) A Different Model’s Representation 𝑞 𝑨 1 , 𝑨 2 , … , 𝑨 𝑂 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 , 𝑥 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 , 𝑥 𝑂 = ෑ 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 , 𝑥 𝑗 𝑗 𝑨 𝑗−1 , 𝑥 𝑗 ) ∝ exp( 𝜄 𝑈 𝑔 𝑥 𝑗 , 𝑨 𝑗−1 , 𝑨 𝑗 ) 𝑞 𝑨 𝑗 … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph

  27. MEMMs … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 Discriminative: don’t care about generating observed sequence at all Maxent: use features Problem: Label-Bias problem

  28. Label-Bias Problem z i w i

  29. Label-Bias Problem z i 1 incoming mass must sum to 1 w i

  30. Label-Bias Problem z i 1 1 incoming mass must outgoing mass must sum to 1 sum to 1 w i

  31. Label-Bias Problem z i 1 1 incoming mass must outgoing mass must sum to 1 sum to 1 w i Take-aways: observe, but do not • the model can learn to generate (explain) the ignore observations observation • the model can get itself stuck on “bad” paths

  32. Outline Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch

  33. (Linear Chain) Conditional Random Fields … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 … Discriminative: don’t care about generating observed sequence at all Condition on the entire observed word sequence w 1 … w N Maxent: use features Solves the label-bias problem

  34. (Linear Chain) Conditional Random Fields … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 … 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝑥 1 , … , 𝑥 𝑂 ) ∝ ෑ 𝑗

  35. (Linear Chain) Conditional Random Fields … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 … 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ∝ ෑ 𝑗 condition on entire sequence

  36. CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & ( 𝑥 𝑗−2 in list of adjectives or determiners))

  37. CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & ( 𝑥 𝑗−2 in list of adjectives or determiners)) • NER f path p ( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = ( 𝑨 𝑗−1 == Per & 𝑨 𝑗 == Per & (syntactic path p involving 𝑥 𝑗 exists ))

  38. CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = Can’t easily do these ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & with an HMM ( 𝑥 𝑗−2 in list of adjectives or determiners)) ➔ • NER Conditional models can allow richer f path p ( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = features ( 𝑨 𝑗−1 == Per & 𝑨 𝑗 == Per & (syntactic path p involving 𝑥 𝑗 exists ))

  39. CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = Can’t easily do these ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & with an HMM We’ll cover syntactic ( 𝑥 𝑗−2 in list of adjectives or determiners)) ➔ paths next class • NER Conditional models can allow richer f path p ( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = features ( 𝑨 𝑗−1 == Per & 𝑨 𝑗 == Per & (syntactic path p involving 𝑥 𝑗 exists ))

  40. CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = Can’t easily do these CRFs can be used in neural networks too: ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & with an HMM ( 𝑥 𝑗−2 in list of adjectives or determiners)) ➔ https://www.tensorflow.org/versions/r1.15/api_docs/python • NER Conditional models /tf/contrib/crf/CrfForwardRnnCell can allow richer f path p ( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = https://pytorch-crf.readthedocs.io/en/stable/ features ( 𝑨 𝑗−1 == Per & 𝑨 𝑗 == Per & (syntactic path p involving 𝑥 𝑗 exists ))

  41. Conditional vs. Sequence We’ll cover these in 691: Graphical and Statistical Models of Learning CRF Tutorial, Fig 1.2, Sutton & McCallum (2012)

  42. Outline Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch

  43. Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f θ wi representations… product compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) = softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) w i predict the next word

  44. A More Typical View of Recurrent Neural Language Modeling w i-2 w i-1 w i w i+1 h i-3 h i-2 h i-1 h i w i-3 w i-2 w i-1 w i

  45. A More Typical View of Recurrent Neural Language Modeling w i-2 w i-1 w i w i+1 h i-3 h i-2 h i-1 h i w i-3 w i-2 w i-1 w i observe these words one at a time

  46. A More Typical View of Recurrent Neural Language Modeling predict the next word w i-2 w i-1 w i w i+1 h i-3 h i-2 h i-1 h i w i-3 w i-2 w i-1 w i observe these words one at a time

  47. A More Typical View of Recurrent Neural Language Modeling predict the next word w i-2 w i-1 w i w i+1 h i-3 h i-2 h i-1 h i from these hidden states w i-3 w i-2 w i-1 w i observe these words one at a time

  48. A More Typical View of Recurrent Neural Language Modeling predict the next word “cell” w i-2 w i-1 w i w i+1 h i-3 h i-2 h i-1 h i from these hidden states w i-3 w i-2 w i-1 w i observe these words one at a time

  49. A Recurrent Neural Network Cell w i w i+1 h i-1 h i w i-1 w i

  50. A Recurrent Neural Network Cell w i w i+1 W W h i-1 h i w i-1 w i

  51. A Recurrent Neural Network Cell w i w i+1 W W h i-1 h i U encoding U w i-1 w i

  52. A Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i

  53. A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 )

  54. A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) 1 𝜏 𝑦 = 1 + exp(−𝑦)

  55. A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) 1 𝜏 𝑦 = 1 + exp(−𝑦)

  56. A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) 1 𝜏 𝑦 = 1 + exp(−𝑦)

  57. A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) 1 𝜏 𝑦 = 1 + exp(−𝑦) 𝑥 𝑗+1 = softmax(𝑇ℎ 𝑗 ) ෝ

  58. A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i must learn matrices U, S, W ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) 𝑥 𝑗+1 = softmax(𝑇ℎ 𝑗 ) ෝ

  59. A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i must learn matrices U, S, W ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) suggested solution: gradient descent on prediction ability 𝑥 𝑗+1 = softmax(𝑇ℎ 𝑗 ) ෝ

  60. A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i must learn matrices U, S, W ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) suggested solution: gradient descent on prediction ability 𝑥 𝑗+1 = softmax(𝑇ℎ 𝑗 ) ෝ problem: they’re tied across inputs/timesteps

  61. A Simple Recurrent Neural Network Cell w i w i+1 S decoding S W W h i-1 h i U encoding U w i-1 w i must learn matrices U, S, W ℎ 𝑗 = 𝜏(𝑋ℎ 𝑗−1 + 𝑉𝑥 𝑗 ) suggested solution: gradient descent on prediction ability problem: they’re tied across inputs/timesteps 𝑥 𝑗+1 = softmax(𝑇ℎ 𝑗 ) ෝ good news for you: many toolkits do this automatically

  62. Why Is Training RNNs Hard? Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives

  63. Why Is Training RNNs Hard? Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives Vanishing gradients Multiply the same matrices at each timestep ➔ multiply many matrices in the gradients

  64. Why Is Training RNNs Hard? Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives Vanishing gradients Multiply the same matrices at each timestep ➔ multiply many matrices in the gradients One solution: clip the gradients to a max value

  65. Outline Review: EM for HMMs Machine Translation Alignment Limited Sequence Models Maximum Entropy Markov Models Conditional Random Fields Recurrent Neural Networks Basic Definitions Example in PyTorch

  66. Natural Language Processing from keras import * from torch import *

  67. Pick Your Toolkit PyTorch Keras Deeplearning4j MxNet TensorFlow Gluon DyNet CNTK Caffe … Comparisons: https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software https://deeplearning4j.org/compare-dl4j-tensorflow-pytorch https://github.com/zer0n/deepframeworks (older---2015)

  68. Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i-1 w i w i+1 h i-2 h i-1 h i w i-2 w i-1 w i

  69. Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i-1 w i w i+1 h i-2 h i-1 h i w i-2 w i-1 w i

  70. Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i-1 w i w i+1 h i-2 h i-1 h i w i-2 w i-1 w i

  71. Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

  72. Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i-1 w i w i+1 h i-2 h i-1 h i w i-2 w i-1 w i encode

  73. Defining A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i+1 w i-1 w i h i-2 h i-1 h i w i-2 w i-1 w i decode

  74. Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

  75. Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html Negative log- likelihood

  76. Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html w i+1 w i-1 w i Negative log- likelihood h i-2 h i-1 h i w i-2 w i-1 w i get predictions

  77. Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html Negative log- likelihood get predictions eval predictions

  78. Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html Negative log- likelihood get predictions eval predictions compute gradient

  79. Training A Simple RNN in Python (Modified Very Slightly) http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html Negative log- likelihood get predictions eval predictions compute gradient perform SGD

  80. Another Solution: LSTMs/GRUs LSTM: Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) GRU: Gated Recurrent Unit (Cho et al., 2014) forget line Basic Ideas: learn to forget http://colah.github.io/posts/2015-08-Understanding-LSTMs/ representation line

Recommend


More recommend