na ve bayes maxent and neural models
play

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some - PowerPoint PPT Presentation

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Nave Bayes (NB) classification Terminology: bag-of-words Nave assumption


  1. We need to score the different combinations. Three people have been A TTACK fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

  2. Score and Combine Our Possibilities score 1 (fatally shot, A TTACK ) C OMBINE posterior score 2 (seriously wounded, A TTACK ) probability of score 3 (Shining Path, A TTACK ) A TTACK … score k (department, A TTACK ) … are all of these uncorrelated?

  3. Score and Combine Our Possibilities score 1 (fatally shot, A TTACK ) C OMBINE posterior score 2 (seriously wounded, A TTACK ) probability of score 3 (Shining Path, A TTACK ) A TTACK … Q: What are the score and combine functions for Naïve Bayes?

  4. Scoring Our Possibilities Three people have been fatally shot, and five people, including a score( , ) = mayor, were seriously wounded as A TTACK a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) …

  5. https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/BQCdH9 Lesson 1

  6. ) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Three people have been fatally S NAP ( score( , ) ) shot, and five people, including a mayor, were seriously wounded A TTACK as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

  7. What function… operates on any real number? is never less than 0?

  8. What function… operates on any real number? is never less than 0? f(x) = exp(x)

  9. ) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Three people have been fatally exp ( score( , ) ) shot, and five people, including a mayor, were seriously wounded A TTACK as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

  10. ) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) …

  11. ) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) … Learn the scores (but we’ll declare what combinations should be looked at)

  12. ) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) weight 1 * occurs 1 (fatally shot, A TTACK ) weight 2 * occurs 2 (seriously wounded, A TTACK ) weight 3 * occurs 3 (Shining Path, A TTACK ) …

  13. ) ∝ Maxent Modeling: Feature Functions p( | Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . weight 1 * occurs 1 (fatally shot, A TTACK ) exp ( ) ) weight 2 * occurs 2 (seriously wounded, A TTACK ) weight 3 * occurs 3 (Shining Path, A TTACK ) … Feature functions help occurs target,type fatally shot,ATTACK = extract useful features (characteristics) of the ቊ 1, target == fatally shot and type == ATTACK data 0, otherwise Generally templated Often binary-valued (0 binary or 1), but can be real- valued

  14. More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = occurs target,type fatally shot, ATTACK = ቊ 1, target == fatally shot and type == ATTACK log 𝑞 fatally shot ATTACK) + log 𝑞 type ATTACK) 0, otherwise + log 𝑞(ATTACK |type) binary Templated real- valued occurs fatally shot, ATTACK = log 𝑞 fatally shot ATTACK) ??? Non-templated Non-templated real-valued count-valued

  15. More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = occurs target,type fatally shot, ATTACK = ቊ 1, target == fatally shot and type == ATTACK log 𝑞 fatally shot ATTACK) + log 𝑞 type ATTACK) 0, otherwise + log 𝑞(ATTACK |type) binary Templated real- valued occurs fatally shot, ATTACK = occurs fatally shot, ATTACK = log 𝑞 fatally shot ATTACK) count fatally sho𝑢 ATTACK) Non-templated Non-templated real-valued count-valued

  16. Maxent Modeling Three people have been fatally shot, and five people, including p( | ) = a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Q: How do we define Z? exp ( ) ) 1 weight 1 * applies 1 (fatally shot, A TTACK ) weight 2 * applies 2 (seriously wounded, A TTACK ) Z weight 3 * applies 3 (Shining Path, A TTACK ) …

  17. Normalization for Classification Z = Σ exp ( weight 1 * occurs 1 (fatally shot, A TTACK ) ) weight 2 * occurs 2 (seriously wounded, A TTACK ) weight 3 * occurs 3 (Shining Path, A TTACK ) … label x 𝑞 𝑦 𝑧) ∝ exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 ) classify doc y with label x in one go

  18. Normalization for Language Model general class-based (X) language model of doc y

  19. Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case

  20. Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case Simplifying assumption: maxent n-grams!

  21. Understanding Conditioning Is this a good language model?

  22. Understanding Conditioning Is this a good language model?

  23. Understanding Conditioning Is this a good language model? (no)

  24. Understanding Conditioning Is this a good posterior classifier? (no)

  25. https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/BQCdH9 Lesson 11

  26. Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

  27. p θ (x | y ) probabilistic model objective (given observations)

  28. Objective = Full Likelihood? Differentiating this These values can have very product could be a pain small magnitude ➔ underflow

  29. Logarithms (0, 1] ➔ (- ∞, 0] Products ➔ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b) Inverse of exp log(exp(x)) = x

  30. Log-Likelihood Wide range of (negative) numbers Sums are more stable Products ➔ Sums log(ab) = log(a) + log(b) Differentiating this log(a/b) = log(a) – log(b) becomes nicer (even though Z depends on θ )

  31. Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x Differentiating this becomes nicer (even though Z depends on θ )

  32. Log-Likelihood Wide range of (negative) numbers Sums are more stable = 𝐺 𝜄

  33. Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

  34. How will we optimize F( θ )? Calculus

  35. F( θ ) θ

  36. F( θ ) θ θ *

  37. F( θ ) F’(θ ) derivative of F wrt θ θ θ *

  38. Example F(x) = -(x-2) 2 differentiate F’(x) = -2x + 4 Solve F’(x) = 0 x = 2

  39. Common Derivative Rules

  40. What if you can’t find the roots? Follow the derivative F( θ ) F’(θ ) derivative of F wrt θ θ θ *

  41. What if you can’t find the roots? Follow the derivative Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) θ θ 0 θ *

  42. What if you can’t find the roots? Follow the derivative Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) g 0 θ θ 0 θ *

  43. What if you can’t find the roots? Follow the derivative Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t g 0 5. Set t += 1 θ θ 0 θ 1 θ *

  44. What if you can’t find the roots? Follow the derivative y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t g 0 g 1 5. Set t += 1 θ θ 0 θ 2 θ 1 θ *

  45. What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t g 0 g 1 5. Set t += 1 g 2 θ θ 0 θ 2 θ 3 θ 1 θ *

  46. What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged : of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t g 0 4. Set θ t+1 = θ t + ρ t *g t g 1 g 2 θ 5. Set t += 1 θ 0 θ 2 θ 3 θ 1 θ *

  47. Gradient = Multi-variable derivative K-dimensional input K-dimensional output

  48. Gradient Ascent

  49. Gradient Ascent

  50. Gradient Ascent

  51. Gradient Ascent

  52. Gradient Ascent

  53. Gradient Ascent

  54. Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

  55. Expectations number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

  56. Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

  57. Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

  58. Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

Recommend


More recommend