part 1
play

(part 1) CMSC 473/673 UMBC Some slides adapted from 3SLP Outline - PowerPoint PPT Presentation

Maxent and Neural Language Models (part 1) CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural


  1. Maxent Modeling Three people have been fatally shot, and five people, including ) ∝ p( | a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . weight 1 * applies 1 ( πŸ— , A TTACK ) exp ( ) ) weight 2 * applies 2 ( πŸ— , A TTACK ) weight 3 * applies 3 ( πŸ— , A TTACK ) …

  2. Maxent Modeling Three people have been fatally shot, and five people, including ) ∝ p( | a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . weight 1 * applies 1 ( πŸ— , A TTACK ) exp ( ) ) weight 2 * applies 2 ( πŸ— , A TTACK ) weight 3 * applies 3 ( πŸ— , A TTACK ) … K different for K different weights … features

  3. Maxent Modeling Three people have been fatally shot, and five people, including ) ∝ p( | a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . weight 1 * applies 1 ( πŸ— , A TTACK ) exp ( ) ) weight 2 * applies 2 ( πŸ— , A TTACK ) weight 3 * applies 3 ( πŸ— , A TTACK ) … K different for K different multiplied and weights … features … then summed

  4. Maxent Modeling Three people have been fatally shot, and five people, including ) ∝ p( | a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) Dot_product of weight_vec feature_vec( πŸ— , A TTACK ) K different for K different multiplied and weights … features … then summed

  5. Maxent Modeling Three people have been fatally shot, and five people, including ) ∝ p( | a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . πœ„ π‘ˆ 𝑔( πŸ— , A TTACK ) exp ( ) K different for K different multiplied and weights … features … then summed

  6. Maxent Modeling Three people have been fatally shot, and five people, including p( | ) = a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . Q: How do we define Z? πœ„ π‘ˆ 𝑔( πŸ— , A TTACK ) 1 exp ( ) ) Z

  7. Normalization for Classification Z = Ξ£ πœ„ π‘ˆ 𝑔( πŸ— , Y ) exp ( ) label y π‘ž 𝑧 𝑦) ∝ exp(πœ„ π‘ˆ 𝑔 𝑦, 𝑧 ) classify doc x with label y in one go

  8. Normalization for Classification (long form) Z = Ξ£ weight 1 * applies 1 ( πŸ— , y) exp ( ) weight 2 * applies 2 ( πŸ— , y) weight 3 * applies 3 ( πŸ— , y) … label y π‘ž 𝑧 𝑦) ∝ exp(πœ„ π‘ˆ 𝑔 𝑦, 𝑧 ) classify doc x with label y in one go

  9. Core Aspects to Maxent Classifier p(y|x) β€’ features 𝑔 𝑦, 𝑧 between x and y that are meaningful; β€’ weights πœ„ (one per feature) to say how important each feature is; and β€’ a way to form probabilities from 𝑔 and πœ„ exp(πœ„ π‘ˆ 𝑔(𝑦, 𝑧)) π‘ž 𝑧 𝑦) = Οƒ 𝑧′ exp(πœ„ π‘ˆ 𝑔(𝑦, 𝑧′))

  10. Outline Maximum Entropy models 1. Defining Appropriate Features Defining the model 2. Understanding features Defining the objective in conditional models Learning: Optimizing the objective Math: gradient derivation Neural (language) models

  11. Defining Appropriate Features in a Maxent Model Feature functions help extract useful features (characteristics) of the data They turn data into numbers Features that are not 0 are said to have fired

  12. Defining Appropriate Features in a Maxent Model Feature functions help extract useful features (characteristics) of the data They turn data into numbers Features that are not 0 are said to have fired Generally templated Often binary-valued (0 or 1), but can be real-valued

  13. Templated Features Define a feature f clue ( πŸ— , label) for each clue you want to consider The feature f clue fires if the clue applies to/can be found in the ( πŸ— , label) pair Clue is often a target phrase (an n-gram) and a label

  14. Templated Features Define a feature f clue ( πŸ— , label) for each clue you want to consider The feature f clue fires if the clue applies to/can be found in the ( πŸ— , label) pair Clue is often a target phrase (an n-gram) and a label Q: For a classifier p(label | πŸ— ) are clues that depend only on πŸ— useful?

  15. Maxent Modeling: Templated Binary Feature Functions ) ∝ p( | Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . weight 1 * applies 1 ( πŸ— , A TTACK ) exp ( ) ) weight 2 * applies 2 ( πŸ— , A TTACK ) weight 3 * applies 3 ( πŸ— , A TTACK ) … applies target,type πŸ— , ATTACK = α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 0, otherwise binary

  16. Example of a Templated Binary Feature Functions applies target,type πŸ— , ATTACK = α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 0, otherwise applies hurt,ATTACK πŸ— , ATTACK = α‰Š1, hurt π‘—π‘œ πŸ— and ATTACK == ATTACK 0, otherwise

  17. Example of a Templated Binary Feature Functions applies target,type πŸ— , ATTACK = α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 0, otherwise applies hurt,ATTACK πŸ— , ATTACK = Q: What does this α‰Š1, hurt π‘—π‘œ πŸ— and ATTACK == ATTACK function check? 0, otherwise

  18. Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type πŸ— , ATTACK = types: α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? applies hurt,ATTACK πŸ— , ATTACK = α‰Š1, hurt π‘—π‘œ πŸ— and ATTACK == ATTACK 0, otherwise

  19. Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type πŸ— , ATTACK = types: α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? A1: π‘Šπ‘€ applies hurt,ATTACK πŸ— , ATTACK = α‰Š1, hurt π‘—π‘œ πŸ— and ATTACK == ATTACK 0, otherwise

  20. Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type πŸ— , ATTACK = types: α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? A1: π‘Šπ‘€ 2. How many features are defined if applies hurt,ATTACK πŸ— , ATTACK = bigram targets are used? α‰Š1, hurt π‘—π‘œ πŸ— and ATTACK == ATTACK 0, otherwise

  21. Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type πŸ— , ATTACK = types: α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? A1: π‘Šπ‘€ 2. How many features are defined if applies hurt,ATTACK πŸ— , ATTACK = bigram targets are used? α‰Š1, hurt π‘—π‘œ πŸ— and ATTACK == ATTACK 0, otherwise A2: π‘Š 2 𝑀

  22. Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type πŸ— , ATTACK = types: α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? A1: π‘Šπ‘€ 2. How many features are defined if applies hurt,ATTACK πŸ— , ATTACK = bigram targets are used? α‰Š1, hurt π‘—π‘œ πŸ— and ATTACK == ATTACK 0, otherwise A2: π‘Š 2 𝑀 3. How many features are defined if unigram and bigram targets are used?

  23. Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type πŸ— , ATTACK = types: α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? A1: π‘Šπ‘€ 2. How many features are defined if applies hurt,ATTACK πŸ— , ATTACK = bigram targets are used? α‰Š1, hurt π‘—π‘œ πŸ— and ATTACK == ATTACK 0, otherwise A2: π‘Š 2 𝑀 3. How many features are defined if unigram and bigram targets are used? A2: (V + π‘Š 2 )𝑀

  24. More on Feature Functions applies target,type πŸ— , ATTACK = α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 0, otherwise binary ??? Non-templated real-valued ??? Templated real- valued

  25. More on Feature Functions applies target,type πŸ— , ATTACK = α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 0, otherwise binary applies πŸ— , ATTACK = log π‘ž πŸ— ATTACK) Non-templated real-valued ??? Templated real- valued

  26. More on Feature Functions applies target,type πŸ— , ATTACK = α‰Š1, target π‘›π‘π‘’π‘‘β„Žπ‘“π‘‘ πŸ— and type == ATTACK 0, otherwise binary applies πŸ— , ATTACK = log π‘ž πŸ— ATTACK) Non-templated real-valued applies target,type πŸ— , ATTACK = log π‘ž πŸ— ATTACK) + log π‘ž type ATTACK) + log π‘ž(ATTACK |type) Templated real- valued

  27. Understanding Conditioning π‘ž 𝑧 𝑦) ∝ count(𝑦) Q: Is this a good model?

  28. Understanding Conditioning π‘ž 𝑧 𝑦) ∝ exp(πœ„ β‹… 𝑔 𝑦 ) Q: Is this a good model?

  29. https://www.csee.umbc.edu/courses/undergraduate/473/f19/loglin-tutorial/ Lesson 11

  30. Earlier, I said Maxent Models as Featureful n-gram Language Models of text x p(Colorless green ideas sleep furiously | Label) = p(Colorless | Label, <BOS>) * … * p(<EOS> | Label , furiously)20 Model each n-gram term with a maxent model π‘ž 𝑦 𝑗 𝑧, 𝑦 π‘—βˆ’π‘‚+1:π‘—βˆ’π‘— ) = maxent(𝑧, 𝑦 π‘—βˆ’π‘‚+1:π‘—βˆ’π‘— , 𝑦 𝑗 ) Q: What would this look like?

  31. Language Model with Maxent n-grams label 𝑁 π‘ž π‘œ πŸ— 𝑧) = ΰ·‘ maxent(𝑧, 𝑦 π‘—βˆ’π‘œ+1:π‘—βˆ’π‘— , 𝑦 𝑗 ) 𝑗=1 n-gram

  32. Language Model with Maxent n-grams label 𝑁 π‘ž π‘œ πŸ— 𝑧) = ΰ·‘ maxent(𝑧, 𝑦 π‘—βˆ’π‘œ+1:π‘—βˆ’π‘— , 𝑦 𝑗 ) 𝑗=1 n-gram 𝑁 exp(πœ„ π‘ˆ 𝑔(𝑧, 𝑦 π‘—βˆ’π‘œ+1:π‘—βˆ’π‘— , 𝑦 𝑗 )) = ΰ·‘ Οƒ 𝑦′ exp(πœ„ π‘ˆ 𝑔(𝑧, 𝑦 π‘—βˆ’π‘œ+1:π‘—βˆ’π‘— , 𝑦′)) 𝑗=1

  33. Language Model with Maxent n-grams 𝑁 π‘ž π‘œ πŸ— 𝑧) = ΰ·‘ maxent(𝑧, 𝑦 π‘—βˆ’π‘œ+1:π‘—βˆ’π‘— , 𝑦 𝑗 ) 𝑗=1 n-gram 𝑁 exp(πœ„ π‘ˆ 𝑔(𝑧, 𝑦 π‘—βˆ’π‘œ+1:π‘—βˆ’π‘— , 𝑦 𝑗 )) = ΰ·‘ π‘Ž(𝑧, 𝑦 π‘—βˆ’π‘œ+1:π‘—βˆ’π‘— ) 𝑗=1

  34. Language Model with Maxent n-grams 𝑁 π‘ž π‘œ πŸ— 𝑧) = ΰ·‘ maxent(𝑧, 𝑦 π‘—βˆ’π‘œ+1:π‘—βˆ’π‘— , 𝑦 𝑗 ) 𝑗=1 n-gram 𝑁 exp(πœ„ π‘ˆ 𝑔(𝑧, 𝑦 π‘—βˆ’π‘œ+1:π‘—βˆ’π‘— , 𝑦 𝑗 )) = ΰ·‘ π‘Ž(𝑧, 𝑦 π‘—βˆ’π‘œ+1:π‘—βˆ’π‘— ) 𝑗=1 Q: Why is this Z a function of the context?

  35. Outline Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

  36. p ΞΈ (u | v ) probabilistic model 𝐺(πœ„; 𝑣, 𝑀) objective

  37. Primary Objective: Likelihood β€’ Goal: maximize the score your model gives to the training data it observes β€’ This is called the likelihood of your data β€’ In classification, this is p(label | πŸ— ) β€’ For language modeling, this is p( πŸ— | label)

  38. Objective = Full Likelihood? (in LM) exp(πœ„ π‘ˆ 𝑔 𝑦 𝑗 , β„Ž ) ΰ·‘ π‘ž πœ„ 𝑦 𝑗 β„Ž 𝑗 ∝ ΰ·‘ 𝑗 𝑗 Differentiating this These values can have very small magnitude βž” underflow product could be a pain (assume β„Ž 𝑗 has whatever context and n-gram history necessary)

  39. Logarithms (0, 1] βž” (- ∞, 0] Products βž” Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b) Inverse of exp log(exp(x)) = x

  40. Log-Likelihood (n-gram LM) Wide range of (negative) numbers Sums are more stable log ΰ·‘ π‘ž πœ„ 𝑦 𝑗 β„Ž 𝑗 = ෍ log π‘ž πœ„ (𝑦 𝑗 |β„Ž 𝑗 ) 𝑗 𝑗 Products βž” Sums log(ab) = log(a) + log(b) Differentiating this log(a/b) = log(a) – log(b) becomes nicer (even though Z depends on ΞΈ )

  41. Log-Likelihood (n-gram LM) Wide range of (negative) numbers Sums are more stable log ΰ·‘ π‘ž πœ„ 𝑦 𝑗 β„Ž 𝑗 = ෍ log π‘ž πœ„ (𝑦 𝑗 |β„Ž 𝑗 ) 𝑗 𝑗 Inverse of exp πœ„ π‘ˆ 𝑔 𝑦 𝑗 , β„Ž 𝑗 βˆ’ log π‘Ž(β„Ž 𝑗 ) = ෍ log(exp(x)) = x 𝑗 Differentiating this becomes nicer (even though Z depends on ΞΈ )

  42. Log-Likelihood (n-gram LM) Wide range of (negative) numbers Sums are more stable log ΰ·‘ π‘ž πœ„ 𝑦 𝑗 β„Ž 𝑗 = ෍ log π‘ž πœ„ (𝑦 𝑗 |β„Ž 𝑗 ) 𝑗 𝑗 πœ„ π‘ˆ 𝑔 𝑦 𝑗 , β„Ž 𝑗 βˆ’ log π‘Ž(β„Ž 𝑗 ) = ෍ 𝑗 = 𝐺 πœ„

  43. Outline Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

  44. How will we optimize F( ΞΈ )? Calculus

  45. F( ΞΈ ) ΞΈ

  46. F( ΞΈ ) ΞΈ ΞΈ *

  47. F( ΞΈ ) F’(ΞΈ ) derivative of F wrt ΞΈ ΞΈ ΞΈ *

  48. What if you can’t find the roots? Follow the derivative F( ΞΈ ) F’(ΞΈ ) derivative of F wrt ΞΈ ΞΈ ΞΈ *

  49. What if you can’t find the roots? Follow the derivative Set t = 0 F( ΞΈ ) F’(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) ΞΈ ΞΈ 0 ΞΈ *

  50. What if you can’t find the roots? Follow the derivative Set t = 0 F( ΞΈ ) F’(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get derivative g t = F’(ΞΈ t ) g 0 ΞΈ ΞΈ 0 ΞΈ *

  51. What if you can’t find the roots? Follow the derivative Set t = 0 F( ΞΈ ) F’(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get derivative g t = F’(ΞΈ t ) 3. Get scaling factor ρ t 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t g 0 5. Set t += 1 ΞΈ ΞΈ 0 ΞΈ 1 ΞΈ *

  52. What if you can’t find the roots? Follow the derivative z 1 Set t = 0 F( ΞΈ ) F’(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get derivative g t = F’(ΞΈ t ) 3. Get scaling factor ρ t 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t g 0 g 1 5. Set t += 1 ΞΈ ΞΈ 0 ΞΈ 1 ΞΈ 2 ΞΈ *

  53. What if you can’t find the roots? Follow the derivative z 3 z 2 z 1 Set t = 0 F( ΞΈ ) F’(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get derivative g t = F’(ΞΈ t ) 3. Get scaling factor ρ t 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t g 0 g 1 5. Set t += 1 g 2 ΞΈ ΞΈ 0 ΞΈ 1 ΞΈ 2 ΞΈ 3 ΞΈ *

  54. What if you can’t find the roots? Follow the derivative z 3 z 2 z 1 Set t = 0 F( ΞΈ ) F’(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged : of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get derivative g t = F’(ΞΈ t ) 3. Get scaling factor ρ t g 0 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t g 1 g 2 ΞΈ 5. Set t += 1 ΞΈ 0 ΞΈ 1 ΞΈ 2 ΞΈ 3 ΞΈ *

  55. Remember: Common Derivative Rules

  56. Gradient = Multi-variable derivative K-dimensional input K-dimensional output

  57. Gradient Ascent

  58. Gradient Ascent

  59. Gradient Ascent

  60. Gradient Ascent

  61. Gradient Ascent

  62. Gradient Ascent

  63. What if you can’t find the roots? Follow the gradient z 3 z 2 z 1 Set t = 0 F( ΞΈ ) F’(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get gradient g t = F’(ΞΈ t ) 3. Get scaling factor ρ t 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t g 0 g 1 5. Set t += 1 g 2 ΞΈ ΞΈ 0 ΞΈ 1 ΞΈ 2 ΞΈ 3 ΞΈ * K-dimensional vectors

  64. Outline Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

  65. Reminder: Expectation of a Random Variable number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 𝔽 π‘Œ = ෍ 𝑦 π‘ž 𝑦 = 3.5 1/6 * 4 + 𝑦 1/6 * 5 + 1/6 * 6

  66. Reminder: Expectation of a Random Variable number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 𝔽 π‘Œ = ෍ 𝑦 π‘ž 𝑦 = 2.5 1/10 * 4 + 𝑦 1/10 * 5 + 1/10 * 6

  67. Reminder: Expectation of a Random Variable number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 𝔽 π‘Œ = ෍ 𝑦 π‘ž 𝑦 1/10 * 3 + = 2.5 1/10 * 4 + 𝑦 1/10 * 5 + 1/10 * 6

  68. Expectations Depend on a Probability Distribution number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 𝔽 π‘Œ = ෍ 𝑦 π‘ž 𝑦 1/10 * 3 + = 2.5 1/10 * 4 + 𝑦 1/10 * 5 + 1/10 * 6

Recommend


More recommend