maxent models iii neural language models
play

Maxent Models (III), & Neural Language Models CMSC 473/673 - PowerPoint PPT Presentation

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some slides adapted from 3SLP Recap from last time Maximum Entropy Models a more general language model argmax ) ()


  1. Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some slides adapted from 3SLP

  2. Recap from last time…

  3. Maximum Entropy Models a more general language model argmax π‘Œ π‘ž 𝑍 π‘Œ) βˆ— π‘ž(π‘Œ) classify in one go argmax π‘Œ π‘ž π‘Œ 𝑍)

  4. Maximum Entropy Models Feature Weights Natural parameters Distribution Parameters Feature function(s) Sufficient statistics β€œStrength” function(s)

  5. What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( ΞΈ ) F’(ΞΈ ) Pick a starting value ΞΈ t y 0 derivative Until converged: of F wrt ΞΈ 1. Get value y t = F( ΞΈ t ) 2. Get derivative g t = F’(ΞΈ t ) 3. Get scaling factor ρ t 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t g 0 g 1 5. Set t += 1 g 2 ΞΈ ΞΈ 0 ΞΈ 2 ΞΈ 3 ΞΈ 1 ΞΈ *

  6. What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( ΞΈ ) F’(ΞΈ ) Pick a starting value ΞΈ t y 0 derivative Until converged : of F wrt ΞΈ 1. Get value y t = F( ΞΈ t ) 2. Get derivative g t = F’(ΞΈ t ) 3. Get scaling factor ρ t g 0 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t g 1 g 2 ΞΈ 5. Set t += 1 ΞΈ 0 ΞΈ 2 ΞΈ 3 ΞΈ 1 ΞΈ *

  7. Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression as statistical regression Softmax regression based in Maximum Entropy models (MaxEnt) information theory Generalized Linear Models a form of Discriminative NaΓ―ve Bayes viewed as to be cool Very shallow (sigmoidal) neural nets today :)

  8. https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo

  9. Objective = Full Likelihood?

  10. Objective = Full Likelihood? Differentiating this These values can have very product could be a pain small magnitude  underflow

  11. Logarithms (0, 1]  (- ∞, 0] Products  Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b) Inverse of exp log(exp(x)) = x

  12. Log-Likelihood Wide range of (negative) numbers Sums are more stable Products  Sums log(ab) = log(a) + log(b) Differentiating this log(a/b) = log(a) – log(b) becomes nicer (even though Z depends on ΞΈ )

  13. Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x π‘ž 𝑧 𝑦) ∝ exp(πœ„ β‹… 𝑔 𝑦, 𝑧 ) Differentiating this becomes nicer (even though Z depends on ΞΈ )

  14. Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x π‘ž 𝑧 𝑦) ∝ exp(πœ„ β‹… 𝑔 𝑦, 𝑧 ) Differentiating this becomes nicer (even though Z depends on ΞΈ )

  15. Log-Likelihood Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on ΞΈ )

  16. Expectations number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

  17. Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

  18. Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

  19. Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

  20. Log-Likelihood Gradient Each component k is the difference between:

  21. Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data

  22. Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p ΞΈ thinks it computes for feature f k

  23. Log-Likelihood Gradient β€œmoment Each component k is the difference matching” between: the total value of feature f k in the training data and the total value the current model p ΞΈ thinks it computes for feature f k

  24. https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo Lesson 6

  25. Log-Likelihood Gradient Derivation

  26. Log-Likelihood Gradient Derivation depends on ΞΈ

  27. Log-Likelihood Gradient Derivation depends on ΞΈ

  28. Log-Likelihood Gradient Derivation depends on ΞΈ

  29. Log-Likelihood Gradient Derivation

  30. Log-Likelihood Gradient Derivation use the (calculus) chain rule πœ– πœ–π‘• πœ–β„Ž πœ–πœ„ log 𝑕(β„Ž πœ„ ) = πœ–β„Ž(πœ„) πœ–πœ„

  31. Log-Likelihood Gradient Derivation use the (calculus) chain rule scalar p(y’ | x i ) πœ– πœ–π‘• πœ–β„Ž πœ–πœ„ log 𝑕(β„Ž πœ„ ) = πœ–β„Ž(πœ„) πœ–πœ„ vector of functions

  32. Log-Likelihood Gradient Derivation

  33. Log-Likelihood Derivative Derivation πœ–πΊ 𝑙 𝑦 𝑗 , 𝑧 β€² π‘ž 𝑧 β€² 𝑦 𝑗 ) = ෍ 𝑔 𝑙 𝑦 𝑗 , 𝑧 𝑗 βˆ’ ෍ ෍ 𝑔 πœ–πœ„ 𝑙 𝑧 β€² 𝑗 𝑗

  34. Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

  35. Preventing Extreme Values NaΓ―ve Bayes Extreme values are 0 probabilities

  36. Preventing Extreme Values NaΓ―ve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large ΞΈ values

  37. Preventing Extreme Values NaΓ―ve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large ΞΈ values regularization

  38. (Squared) L2 Regularization

  39. https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo Lesson 8

  40. (More on) Connections to Other Machine Learning Techniques

  41. Classification: Discriminative NaΓ―ve Bayes Label/class Observed features NaΓ―ve Bayes

  42. Classification: Discriminative NaΓ―ve Bayes Label/class Observed features NaΓ―ve Bayes Maxent/ Logistic Regression

  43. Multinomial Logistic Regression

  44. Multinomial Logistic Regression (in one dimension)

  45. Multinomial Logistic Regression

  46. Understanding Conditioning Is this a good language model?

  47. Understanding Conditioning Is this a good language model?

  48. Understanding Conditioning Is this a good language model? (no)

  49. Understanding Conditioning Is this a good posterior classifier? (no)

  50. https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo Lesson 11

  51. Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression as statistical regression Softmax regression based in Maximum Entropy models (MaxEnt) information theory Generalized Linear Models a form of Discriminative NaΓ―ve Bayes viewed as to be cool Very shallow (sigmoidal) neural nets today :)

  52. Revisiting the S NAP Function softmax

  53. Revisiting the S NAP Function softmax

  54. N-gram Language Models given some context… w i-3 w i-2 w i-1 w i predict the next word

  55. N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 ,π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ π‘‘π‘π‘£π‘œπ‘’(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 , π‘₯ 𝑗 ) w i predict the next word

  56. N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ π‘‘π‘π‘£π‘œπ‘’(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 , π‘₯ 𝑗 ) w i predict the next word

  57. Maxent Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ β‹… 𝑔(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 , π‘₯ 𝑗 )) w i predict the next word

  58. Neural Language Models given some context… w i-3 w i-2 w i-1 can we learn the feature function(s)? compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ β‹… π’ˆ(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 , π‘₯ 𝑗 )) w i predict the next word

  59. Neural Language Models given some context… w i-3 w i-2 w i-1 can we learn the feature function(s) for just the context? compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ 𝒙 𝒋 β‹… π’ˆ(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 )) can we learn word-specific weights (by type)? w i predict the next word

  60. Neural Language Models given some context… w i-3 w i-2 w i-1 create/use β€œ distributed representations”… e w e i-3 e i-2 e i-1 compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ π‘₯ 𝑗 β‹… π’ˆ(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 )) w i predict the next word

  61. Neural Language Models given some context… w i-3 w i-2 w i-1 create/use β€œ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f representations… product compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ π‘₯ 𝑗 β‹… π’ˆ(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 )) w i predict the next word

Recommend


More recommend