recurrent neural models
play

Recurrent Neural Models: Language Models, and Sequence Prediction - PowerPoint PPT Presentation

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673 Frank Ferraro WARNING: Neural methods are NOT the only way to do sequence prediction: Structured Perceptron (478/678) Hidden Markov Models


  1. Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673 Frank Ferraro

  2. WARNING: Neural methods are NOT the only way to do sequence prediction: • Structured Perceptron (478/678) • Hidden Markov Models (473/673, 678, 691 GSML) • Conditional Random Fields (473/673, 678, 691 GSML) • (and others)

  3. CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = Can’t easily do these CRFs can be used in neural networks too: ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & with an HMM ( 𝑥 𝑗−2 in list of adjectives or determiners)) ➔ • https://www.tensorflow.org/versions/r1.15/api_docs/pyt • NER Conditional models hon/tf/contrib/crf/CrfForwardRnnCell can allow richer f path p ( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = • https://pytorch-crf.readthedocs.io/en/stable/ features ( 𝑨 𝑗−1 == Per & 𝑨 𝑗 == Per & (syntactic path p involving 𝑥 𝑗 exists ))

  4. Outline Types of networks Basic cell definition Example in PyTorch

  5. A Note on Graphical Notation Output: Label, Sequence of y labels, Generated Text, etc. h Input: Could be BOW, x sequence of items, structured input, etc.

  6. A Note on Graphical Notation Output: Label, Sequence of y labels, Generated Text, etc. h Hidden state/representation Input: Could be BOW, x sequence of items, structured input, etc.

  7. A Note on Graphical Notation Output: Label, Sequence of y labels, Generated Text, etc. h Hidden state/representation • h computed by a neural cell, or factor • This is called the encoder Input: Could be BOW, x sequence of items, structured input, etc.

  8. A Note on Graphical Notation Output: Label, Sequence of y labels, Generated Text, etc. • y is predicted/generated from h another neural cell, or factor • This is called the decoder h Hidden state/representation • h computed by a neural cell, or factor • This is called the encoder Input: Could be BOW, x sequence of items, structured input, etc.

  9. A Note on Graphical Notation Output: Label, Sequence of y labels, Generated Text, etc. • y is predicted/generated from h another neural cell, or factor • This is called the decoder h Hidden state/representation • h computed by a neural cell, or factor • This is called the encoder Input: Could be BOW, x sequence of items, structured input, etc. The red arrows indicate parameters to learn

  10. Five Broad Categories of Neural Networks Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (“sequence -to- sequence”: with time delay)

  11. Five Broad Categories of Neural Networks Single Input, Single Output “Single”: fixed number of items Single Input, Multiple Outputs “Multiple”: variable number Multiple Inputs, Single Output of items Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (“sequence -to- sequence”: with time delay)

  12. Network Types: Single Input, Single Output y 1. Feed forward Linearizable feature input h Bag-of-items classification/regression Basic non-linear model x We’ve already seen some instances of this

  13. Terminology Recall from maxent slides common NLP Log-Linear Models term (Multinomial) logistic regression as statistical regression Softmax regression based in Maximum Entropy models (MaxEnt) information theory Generalized Linear Models a form of Discriminative Naïve Bayes viewed as to be cool Very shallow (sigmoidal) neural nets today :)

  14. Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) = softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) w i predict the next word

  15. Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 x no learned representation h compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) = softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) y w i predict the next word

  16. Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f θ wi representations… product compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) w i predict the next word

  17. Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 x create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f θ wi representations… product h compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) y w i predict the next word

  18. Common Types of Single Input, Single Output • Feed forward networks • Multilayer perceptrons (MLPs) General Formulation: Input: x Compute: h 0 = x for layer l = 1 to L: linear layer h l = f l (W l h l-1 + b l ) hidden state (non-linear) at layer l activation function at l return argmax 𝑧 softmax 𝜄ℎ 𝑀

  19. Common Types of Single Input, Single Output • Feed forward networks • Multilayer perceptrons (MLPs) General Formulation: Input: x Compute: h 0 = x for layer l = 1 to L: linear layer h l = f l (W l h l-1 + b l ) hidden state (non-linear) at layer l activation function at l return argmax 𝑧 softmax 𝜄ℎ 𝑀

  20. Common Types of Single Input, Single Output • Feed forward networks • Multilayer perceptrons (MLPs) In Pytorch (torch.nn): General Formulation: Activation functions: https://pytorch.org/docs/stable/nn.html?highlight Input: x =activation#non-linear-activations-weighted-sum- Compute: nonlinearity h 0 = x Linear layer: for layer l = 1 to L: https://pytorch.org/docs/stable/nn.html#linear- linear layer h l = f l (W l h l-1 + b l ) layers hidden state (non-linear) torch.nn.Linear( at layer l activation in_features=<dim of h l-1 >, function at l out_features=<dim of h l >, return argmax 𝑧 softmax 𝜄ℎ 𝑀 bias=<Boolean: include bias b l >)

  21. Network Types: Single Input, Multiple Outputs y 0 y 1 y 2 Recursive: One input, Sequence output Label-based generation h 0 h 1 h 2 Automated caption generation x

  22. Label-Based Generation Given a label y, generate an entire text 🗏 𝑞 🗏 𝑧) argmax 🗏 argmax 𝑞 𝑥 1 , … , 𝑥 𝑂 𝑧) 𝑥 1 ,…,𝑥 𝑂 Performing this argmax is difficult, and often requires an approximate search technique called beam search

  23. Example: Sentiment-based Tweet Generation Given a sentiment label y (e.g., HAPPY , SAD , ANGRY , etc.), generate a tweet that would be expressing that sentiment 𝑞 🗏 𝑧) argmax 🗏 argmax 𝑞 𝑥 1 , … , 𝑥 𝑂 𝑧) 𝑥 1 ,…,𝑥 𝑂 Q: Why might you want to do this? Q: What ethical aspects should you consider? Q: What is the potential harm?

  24. Example: Image Caption Generation Show and Tell: A Neural Image Caption Generator, CVPR 15 Slide credit: Arun Mallya

  25. Network Types: Multiple Inputs, Single Output y Recursive: Sequence input, one output Document classification h 0 h 1 h 2 Action recognition in video (high-level) x 0 x 1 x 2

  26. Network Types: Multiple Inputs, Single Output Recursive: Sequence input, one output y Document classification Action recognition in video (high-level) h 0 h 1 h 2 Think of this as generalizing using maxent models to build discriminatively trained classifiers 𝑞 𝑧 𝑦) = maxent 𝑦, 𝑧 x 0 x 1 x 2 ➔ 𝑞 𝑧 𝑦) = recurrent_classifier 𝑦, 𝑧

  27. Example: RTE (many options) s: Michael Jordan, coach Phil Jackson and the star cast, including Scottie Pippen, took p( | ) the Chicago Bulls to six E NTAILED National Basketball Association championships. z: The Bulls basketball team is based in Chicago. y … … h s,0 h s,N h z,0 h z,M s 0 s N z 0 z M

  28. Example: RTE (many options) s: Michael Jordan, coach Phil Jackson and the star cast, including Scottie Pippen, took p( | ) the Chicago Bulls to six E NTAILED National Basketball Association championships. z: The Bulls basketball team is based in Chicago. y … … h s,0 h s,N h z,0 h z,M s 0 s N z 0 z M

  29. Many (but not all) of these tasks fall into the Reminder! Multiple Inputs, Single Output regime GLUE https://gluebenchmark.com/ https://super.gluebenchmark.com/

  30. Network Types: Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) y 0 y 1 y 2 h 0 h 1 h 2 x 0 x 1 x 2 Recursive: Sequence input, Sequence output Part of speech tagging Named entity recognition

Recommend


More recommend