We need to score the different combinations. Three people have been A TTACK fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.
Score and Combine Our Possibilities score 1 (fatally shot, A TTACK ) C OMBINE posterior score 2 (seriously wounded, A TTACK ) probability of score 3 (Shining Path, A TTACK ) A TTACK … score k (department, A TTACK ) … are all of these uncorrelated?
Score and Combine Our Possibilities score 1 (fatally shot, A TTACK ) C OMBINE posterior score 2 (seriously wounded, A TTACK ) probability of score 3 (Shining Path, A TTACK ) A TTACK … Q: What are the score and combine functions for Naïve Bayes?
Scoring Our Possibilities Three people have been fatally shot, and five people, including a score( , ) = mayor, were seriously wounded as A TTACK a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) …
https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/BQCdH9 Lesson 1
) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Three people have been fatally S NAP ( score( , ) ) shot, and five people, including a mayor, were seriously wounded A TTACK as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
What function… operates on any real number? is never less than 0?
What function… operates on any real number? is never less than 0? f(x) = exp(x)
) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Three people have been fatally exp ( score( , ) ) shot, and five people, including a mayor, were seriously wounded A TTACK as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) …
) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) … Learn the scores (but we’ll declare what combinations should be looked at)
) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) weight 1 * occurs 1 (fatally shot, A TTACK ) weight 2 * occurs 2 (seriously wounded, A TTACK ) weight 3 * occurs 3 (Shining Path, A TTACK ) …
) ∝ Maxent Modeling: Feature Functions p( | Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . weight 1 * occurs 1 (fatally shot, A TTACK ) exp ( ) ) weight 2 * occurs 2 (seriously wounded, A TTACK ) weight 3 * occurs 3 (Shining Path, A TTACK ) … Feature functions help occurs target,type fatally shot,ATTACK = extract useful features (characteristics) of the ቊ 1, target == fatally shot and type == ATTACK data 0, otherwise Generally templated Often binary-valued (0 binary or 1), but can be real- valued
More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = occurs target,type fatally shot, ATTACK = ቊ 1, target == fatally shot and type == ATTACK log 𝑞 fatally shot ATTACK) + log 𝑞 type ATTACK) 0, otherwise + log 𝑞(ATTACK |type) binary Templated real- valued occurs fatally shot, ATTACK = log 𝑞 fatally shot ATTACK) ??? Non-templated Non-templated real-valued count-valued
More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = occurs target,type fatally shot, ATTACK = ቊ 1, target == fatally shot and type == ATTACK log 𝑞 fatally shot ATTACK) + log 𝑞 type ATTACK) 0, otherwise + log 𝑞(ATTACK |type) binary Templated real- valued occurs fatally shot, ATTACK = occurs fatally shot, ATTACK = log 𝑞 fatally shot ATTACK) count fatally sho𝑢 ATTACK) Non-templated Non-templated real-valued count-valued
Maxent Modeling Three people have been fatally shot, and five people, including p( | ) = a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Q: How do we define Z? exp ( ) ) 1 weight 1 * applies 1 (fatally shot, A TTACK ) weight 2 * applies 2 (seriously wounded, A TTACK ) Z weight 3 * applies 3 (Shining Path, A TTACK ) …
Normalization for Classification Z = Σ exp ( weight 1 * occurs 1 (fatally shot, A TTACK ) ) weight 2 * occurs 2 (seriously wounded, A TTACK ) weight 3 * occurs 3 (Shining Path, A TTACK ) … label x 𝑞 𝑦 𝑧) ∝ exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 ) classify doc y with label x in one go
Normalization for Language Model general class-based (X) language model of doc y
Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case
Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case Simplifying assumption: maxent n-grams!
Understanding Conditioning Is this a good language model?
Understanding Conditioning Is this a good language model?
Understanding Conditioning Is this a good language model? (no)
Understanding Conditioning Is this a good posterior classifier? (no)
https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/BQCdH9 Lesson 11
Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
p θ (x | y ) probabilistic model objective (given observations)
Objective = Full Likelihood? Differentiating this These values can have very product could be a pain small magnitude ➔ underflow
Logarithms (0, 1] ➔ (- ∞, 0] Products ➔ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b) Inverse of exp log(exp(x)) = x
Log-Likelihood Wide range of (negative) numbers Sums are more stable Products ➔ Sums log(ab) = log(a) + log(b) Differentiating this log(a/b) = log(a) – log(b) becomes nicer (even though Z depends on θ )
Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x Differentiating this becomes nicer (even though Z depends on θ )
Log-Likelihood Wide range of (negative) numbers Sums are more stable = 𝐺 𝜄
Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
How will we optimize F( θ )? Calculus
F( θ ) θ
F( θ ) θ θ *
F( θ ) F’(θ ) derivative of F wrt θ θ θ *
Example F(x) = -(x-2) 2 differentiate F’(x) = -2x + 4 Solve F’(x) = 0 x = 2
Common Derivative Rules
What if you can’t find the roots? Follow the derivative F( θ ) F’(θ ) derivative of F wrt θ θ θ *
What if you can’t find the roots? Follow the derivative Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) θ θ 0 θ *
What if you can’t find the roots? Follow the derivative Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) g 0 θ θ 0 θ *
What if you can’t find the roots? Follow the derivative Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t g 0 5. Set t += 1 θ θ 0 θ 1 θ *
What if you can’t find the roots? Follow the derivative y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t g 0 g 1 5. Set t += 1 θ θ 0 θ 2 θ 1 θ *
What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t g 0 g 1 5. Set t += 1 g 2 θ θ 0 θ 2 θ 3 θ 1 θ *
What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged : of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t g 0 4. Set θ t+1 = θ t + ρ t *g t g 1 g 2 θ 5. Set t += 1 θ 0 θ 2 θ 3 θ 1 θ *
Gradient = Multi-variable derivative K-dimensional input K-dimensional output
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Expectations number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6
Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6
Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6
Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6
Recommend
More recommend