Maxent Modeling Three people have been fatally shot, and five people, including ) β p( | a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . weight 1 * applies 1 ( π , A TTACK ) exp ( ) ) weight 2 * applies 2 ( π , A TTACK ) weight 3 * applies 3 ( π , A TTACK ) β¦
Maxent Modeling Three people have been fatally shot, and five people, including ) β p( | a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . weight 1 * applies 1 ( π , A TTACK ) exp ( ) ) weight 2 * applies 2 ( π , A TTACK ) weight 3 * applies 3 ( π , A TTACK ) β¦ K different for K different weights β¦ features
Maxent Modeling Three people have been fatally shot, and five people, including ) β p( | a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . weight 1 * applies 1 ( π , A TTACK ) exp ( ) ) weight 2 * applies 2 ( π , A TTACK ) weight 3 * applies 3 ( π , A TTACK ) β¦ K different for K different multiplied and weights β¦ features β¦ then summed
Maxent Modeling Three people have been fatally shot, and five people, including ) β p( | a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) Dot_product of weight_vec feature_vec( π , A TTACK ) K different for K different multiplied and weights β¦ features β¦ then summed
Maxent Modeling Three people have been fatally shot, and five people, including ) β p( | a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . π π π( π , A TTACK ) exp ( ) K different for K different multiplied and weights β¦ features β¦ then summed
Maxent Modeling Three people have been fatally shot, and five people, including p( | ) = a mayor, were seriously wounded as a result of a A TTACK Shining Path attack today against a community in Junin department, central Peruvian mountain region . Q: How do we define Z? π π π( π , A TTACK ) 1 exp ( ) ) Z
Normalization for Classification Z = Ξ£ π π π( π , Y ) exp ( ) label y π π§ π¦) β exp(π π π π¦, π§ ) classify doc x with label y in one go
Normalization for Classification (long form) Z = Ξ£ weight 1 * applies 1 ( π , y) exp ( ) weight 2 * applies 2 ( π , y) weight 3 * applies 3 ( π , y) β¦ label y π π§ π¦) β exp(π π π π¦, π§ ) classify doc x with label y in one go
Core Aspects to Maxent Classifier p(y|x) β’ features π π¦, π§ between x and y that are meaningful; β’ weights π (one per feature) to say how important each feature is; and β’ a way to form probabilities from π and π exp(π π π(π¦, π§)) π π§ π¦) = Ο π§β² exp(π π π(π¦, π§β²))
Outline Maximum Entropy models 1. Defining Appropriate Features Defining the model 2. Understanding features Defining the objective in conditional models Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Defining Appropriate Features in a Maxent Model Feature functions help extract useful features (characteristics) of the data They turn data into numbers Features that are not 0 are said to have fired
Defining Appropriate Features in a Maxent Model Feature functions help extract useful features (characteristics) of the data They turn data into numbers Features that are not 0 are said to have fired Generally templated Often binary-valued (0 or 1), but can be real-valued
Templated Features Define a feature f clue ( π , label) for each clue you want to consider The feature f clue fires if the clue applies to/can be found in the ( π , label) pair Clue is often a target phrase (an n-gram) and a label
Templated Features Define a feature f clue ( π , label) for each clue you want to consider The feature f clue fires if the clue applies to/can be found in the ( π , label) pair Clue is often a target phrase (an n-gram) and a label Q: For a classifier p(label | π ) are clues that depend only on π useful?
Maxent Modeling: Templated Binary Feature Functions ) β p( | Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . weight 1 * applies 1 ( π , A TTACK ) exp ( ) ) weight 2 * applies 2 ( π , A TTACK ) weight 3 * applies 3 ( π , A TTACK ) β¦ applies target,type π , ATTACK = α1, target πππ’πβππ‘ π and type == ATTACK 0, otherwise binary
Example of a Templated Binary Feature Functions applies target,type π , ATTACK = α1, target πππ’πβππ‘ π and type == ATTACK 0, otherwise applies hurt,ATTACK π , ATTACK = α1, hurt ππ π and ATTACK == ATTACK 0, otherwise
Example of a Templated Binary Feature Functions applies target,type π , ATTACK = α1, target πππ’πβππ‘ π and type == ATTACK 0, otherwise applies hurt,ATTACK π , ATTACK = Q: What does this α1, hurt ππ π and ATTACK == ATTACK function check? 0, otherwise
Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type π , ATTACK = types: α1, target πππ’πβππ‘ π and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? applies hurt,ATTACK π , ATTACK = α1, hurt ππ π and ATTACK == ATTACK 0, otherwise
Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type π , ATTACK = types: α1, target πππ’πβππ‘ π and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? A1: ππ applies hurt,ATTACK π , ATTACK = α1, hurt ππ π and ATTACK == ATTACK 0, otherwise
Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type π , ATTACK = types: α1, target πππ’πβππ‘ π and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? A1: ππ 2. How many features are defined if applies hurt,ATTACK π , ATTACK = bigram targets are used? α1, hurt ππ π and ATTACK == ATTACK 0, otherwise
Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type π , ATTACK = types: α1, target πππ’πβππ‘ π and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? A1: ππ 2. How many features are defined if applies hurt,ATTACK π , ATTACK = bigram targets are used? α1, hurt ππ π and ATTACK == ATTACK 0, otherwise A2: π 2 π
Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type π , ATTACK = types: α1, target πππ’πβππ‘ π and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? A1: ππ 2. How many features are defined if applies hurt,ATTACK π , ATTACK = bigram targets are used? α1, hurt ππ π and ATTACK == ATTACK 0, otherwise A2: π 2 π 3. How many features are defined if unigram and bigram targets are used?
Example of a Templated Binary Feature Functions Q: If there are V vocab types and L label applies target,type π , ATTACK = types: α1, target πππ’πβππ‘ π and type == ATTACK 1. How many features are defined if 0, otherwise unigram targets are used? A1: ππ 2. How many features are defined if applies hurt,ATTACK π , ATTACK = bigram targets are used? α1, hurt ππ π and ATTACK == ATTACK 0, otherwise A2: π 2 π 3. How many features are defined if unigram and bigram targets are used? A2: (V + π 2 )π
More on Feature Functions applies target,type π , ATTACK = α1, target πππ’πβππ‘ π and type == ATTACK 0, otherwise binary ??? Non-templated real-valued ??? Templated real- valued
More on Feature Functions applies target,type π , ATTACK = α1, target πππ’πβππ‘ π and type == ATTACK 0, otherwise binary applies π , ATTACK = log π π ATTACK) Non-templated real-valued ??? Templated real- valued
More on Feature Functions applies target,type π , ATTACK = α1, target πππ’πβππ‘ π and type == ATTACK 0, otherwise binary applies π , ATTACK = log π π ATTACK) Non-templated real-valued applies target,type π , ATTACK = log π π ATTACK) + log π type ATTACK) + log π(ATTACK |type) Templated real- valued
Understanding Conditioning π π§ π¦) β count(π¦) Q: Is this a good model?
Understanding Conditioning π π§ π¦) β exp(π β π π¦ ) Q: Is this a good model?
https://www.csee.umbc.edu/courses/undergraduate/473/f19/loglin-tutorial/ Lesson 11
Earlier, I said Maxent Models as Featureful n-gram Language Models of text x p(Colorless green ideas sleep furiously | Label) = p(Colorless | Label, <BOS>) * β¦ * p(<EOS> | Label , furiously)20 Model each n-gram term with a maxent model π π¦ π π§, π¦ πβπ+1:πβπ ) = maxent(π§, π¦ πβπ+1:πβπ , π¦ π ) Q: What would this look like?
Language Model with Maxent n-grams label π π π π π§) = ΰ· maxent(π§, π¦ πβπ+1:πβπ , π¦ π ) π=1 n-gram
Language Model with Maxent n-grams label π π π π π§) = ΰ· maxent(π§, π¦ πβπ+1:πβπ , π¦ π ) π=1 n-gram π exp(π π π(π§, π¦ πβπ+1:πβπ , π¦ π )) = ΰ· Ο π¦β² exp(π π π(π§, π¦ πβπ+1:πβπ , π¦β²)) π=1
Language Model with Maxent n-grams π π π π π§) = ΰ· maxent(π§, π¦ πβπ+1:πβπ , π¦ π ) π=1 n-gram π exp(π π π(π§, π¦ πβπ+1:πβπ , π¦ π )) = ΰ· π(π§, π¦ πβπ+1:πβπ ) π=1
Language Model with Maxent n-grams π π π π π§) = ΰ· maxent(π§, π¦ πβπ+1:πβπ , π¦ π ) π=1 n-gram π exp(π π π(π§, π¦ πβπ+1:πβπ , π¦ π )) = ΰ· π(π§, π¦ πβπ+1:πβπ ) π=1 Q: Why is this Z a function of the context?
Outline Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
p ΞΈ (u | v ) probabilistic model πΊ(π; π£, π€) objective
Primary Objective: Likelihood β’ Goal: maximize the score your model gives to the training data it observes β’ This is called the likelihood of your data β’ In classification, this is p(label | π ) β’ For language modeling, this is p( π | label)
Objective = Full Likelihood? (in LM) exp(π π π π¦ π , β ) ΰ· π π π¦ π β π β ΰ· π π Differentiating this These values can have very small magnitude β underflow product could be a pain (assume β π has whatever context and n-gram history necessary)
Logarithms (0, 1] β (- β, 0] Products β Sums log(ab) = log(a) + log(b) log(a/b) = log(a) β log(b) Inverse of exp log(exp(x)) = x
Log-Likelihood (n-gram LM) Wide range of (negative) numbers Sums are more stable log ΰ· π π π¦ π β π = ΰ· log π π (π¦ π |β π ) π π Products β Sums log(ab) = log(a) + log(b) Differentiating this log(a/b) = log(a) β log(b) becomes nicer (even though Z depends on ΞΈ )
Log-Likelihood (n-gram LM) Wide range of (negative) numbers Sums are more stable log ΰ· π π π¦ π β π = ΰ· log π π (π¦ π |β π ) π π Inverse of exp π π π π¦ π , β π β log π(β π ) = ΰ· log(exp(x)) = x π Differentiating this becomes nicer (even though Z depends on ΞΈ )
Log-Likelihood (n-gram LM) Wide range of (negative) numbers Sums are more stable log ΰ· π π π¦ π β π = ΰ· log π π (π¦ π |β π ) π π π π π π¦ π , β π β log π(β π ) = ΰ· π = πΊ π
Outline Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
How will we optimize F( ΞΈ )? Calculus
F( ΞΈ ) ΞΈ
F( ΞΈ ) ΞΈ ΞΈ *
F( ΞΈ ) Fβ(ΞΈ ) derivative of F wrt ΞΈ ΞΈ ΞΈ *
What if you canβt find the roots? Follow the derivative F( ΞΈ ) Fβ(ΞΈ ) derivative of F wrt ΞΈ ΞΈ ΞΈ *
What if you canβt find the roots? Follow the derivative Set t = 0 F( ΞΈ ) Fβ(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) ΞΈ ΞΈ 0 ΞΈ *
What if you canβt find the roots? Follow the derivative Set t = 0 F( ΞΈ ) Fβ(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get derivative g t = Fβ(ΞΈ t ) g 0 ΞΈ ΞΈ 0 ΞΈ *
What if you canβt find the roots? Follow the derivative Set t = 0 F( ΞΈ ) Fβ(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get derivative g t = Fβ(ΞΈ t ) 3. Get scaling factor Ο t 4. Set ΞΈ t+1 = ΞΈ t + Ο t *g t g 0 5. Set t += 1 ΞΈ ΞΈ 0 ΞΈ 1 ΞΈ *
What if you canβt find the roots? Follow the derivative z 1 Set t = 0 F( ΞΈ ) Fβ(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get derivative g t = Fβ(ΞΈ t ) 3. Get scaling factor Ο t 4. Set ΞΈ t+1 = ΞΈ t + Ο t *g t g 0 g 1 5. Set t += 1 ΞΈ ΞΈ 0 ΞΈ 1 ΞΈ 2 ΞΈ *
What if you canβt find the roots? Follow the derivative z 3 z 2 z 1 Set t = 0 F( ΞΈ ) Fβ(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get derivative g t = Fβ(ΞΈ t ) 3. Get scaling factor Ο t 4. Set ΞΈ t+1 = ΞΈ t + Ο t *g t g 0 g 1 5. Set t += 1 g 2 ΞΈ ΞΈ 0 ΞΈ 1 ΞΈ 2 ΞΈ 3 ΞΈ *
What if you canβt find the roots? Follow the derivative z 3 z 2 z 1 Set t = 0 F( ΞΈ ) Fβ(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged : of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get derivative g t = Fβ(ΞΈ t ) 3. Get scaling factor Ο t g 0 4. Set ΞΈ t+1 = ΞΈ t + Ο t *g t g 1 g 2 ΞΈ 5. Set t += 1 ΞΈ 0 ΞΈ 1 ΞΈ 2 ΞΈ 3 ΞΈ *
Remember: Common Derivative Rules
Gradient = Multi-variable derivative K-dimensional input K-dimensional output
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
What if you canβt find the roots? Follow the gradient z 3 z 2 z 1 Set t = 0 F( ΞΈ ) Fβ(ΞΈ ) z 0 Pick a starting value ΞΈ t derivative Until converged: of F wrt ΞΈ 1. Get value z t = F( ΞΈ t ) 2. Get gradient g t = Fβ(ΞΈ t ) 3. Get scaling factor Ο t 4. Set ΞΈ t+1 = ΞΈ t + Ο t *g t g 0 g 1 5. Set t += 1 g 2 ΞΈ ΞΈ 0 ΞΈ 1 ΞΈ 2 ΞΈ 3 ΞΈ * K-dimensional vectors
Outline Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
Reminder: Expectation of a Random Variable number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + π½ π = ΰ· π¦ π π¦ = 3.5 1/6 * 4 + π¦ 1/6 * 5 + 1/6 * 6
Reminder: Expectation of a Random Variable number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + π½ π = ΰ· π¦ π π¦ = 2.5 1/10 * 4 + π¦ 1/10 * 5 + 1/10 * 6
Reminder: Expectation of a Random Variable number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + π½ π = ΰ· π¦ π π¦ 1/10 * 3 + = 2.5 1/10 * 4 + π¦ 1/10 * 5 + 1/10 * 6
Expectations Depend on a Probability Distribution number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + π½ π = ΰ· π¦ π π¦ 1/10 * 3 + = 2.5 1/10 * 4 + π¦ 1/10 * 5 + 1/10 * 6
Recommend
More recommend