Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some slides adapted from 3SLP
Recap from last timeβ¦
Maximum Entropy Models a more general language model argmax π π π π) β π(π) classify in one go argmax π π π π)
Maximum Entropy Models Feature Weights Natural parameters Distribution Parameters Feature function(s) Sufficient statistics βStrengthβ function(s)
What if you canβt find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( ΞΈ ) Fβ(ΞΈ ) Pick a starting value ΞΈ t y 0 derivative Until converged: of F wrt ΞΈ 1. Get value y t = F( ΞΈ t ) 2. Get derivative g t = Fβ(ΞΈ t ) 3. Get scaling factor Ο t 4. Set ΞΈ t+1 = ΞΈ t + Ο t *g t g 0 g 1 5. Set t += 1 g 2 ΞΈ ΞΈ 0 ΞΈ 2 ΞΈ 3 ΞΈ 1 ΞΈ *
What if you canβt find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( ΞΈ ) Fβ(ΞΈ ) Pick a starting value ΞΈ t y 0 derivative Until converged : of F wrt ΞΈ 1. Get value y t = F( ΞΈ t ) 2. Get derivative g t = Fβ(ΞΈ t ) 3. Get scaling factor Ο t g 0 4. Set ΞΈ t+1 = ΞΈ t + Ο t *g t g 1 g 2 ΞΈ 5. Set t += 1 ΞΈ 0 ΞΈ 2 ΞΈ 3 ΞΈ 1 ΞΈ *
Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression as statistical regression Softmax regression based in Maximum Entropy models (MaxEnt) information theory Generalized Linear Models a form of Discriminative NaΓ―ve Bayes viewed as to be cool Very shallow (sigmoidal) neural nets today :)
https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo
Objective = Full Likelihood?
Objective = Full Likelihood? Differentiating this These values can have very product could be a pain small magnitude ο¨ underflow
Logarithms (0, 1] ο¨ (- β, 0] Products ο¨ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) β log(b) Inverse of exp log(exp(x)) = x
Log-Likelihood Wide range of (negative) numbers Sums are more stable Products ο¨ Sums log(ab) = log(a) + log(b) Differentiating this log(a/b) = log(a) β log(b) becomes nicer (even though Z depends on ΞΈ )
Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x π π§ π¦) β exp(π β π π¦, π§ ) Differentiating this becomes nicer (even though Z depends on ΞΈ )
Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x π π§ π¦) β exp(π β π π¦, π§ ) Differentiating this becomes nicer (even though Z depends on ΞΈ )
Log-Likelihood Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on ΞΈ )
Expectations number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6
Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6
Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6
Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6
Log-Likelihood Gradient Each component k is the difference between:
Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data
Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p ΞΈ thinks it computes for feature f k
Log-Likelihood Gradient βmoment Each component k is the difference matchingβ between: the total value of feature f k in the training data and the total value the current model p ΞΈ thinks it computes for feature f k
https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo Lesson 6
Log-Likelihood Gradient Derivation
Log-Likelihood Gradient Derivation depends on ΞΈ
Log-Likelihood Gradient Derivation depends on ΞΈ
Log-Likelihood Gradient Derivation depends on ΞΈ
Log-Likelihood Gradient Derivation
Log-Likelihood Gradient Derivation use the (calculus) chain rule π ππ πβ ππ log π(β π ) = πβ(π) ππ
Log-Likelihood Gradient Derivation use the (calculus) chain rule scalar p(yβ | x i ) π ππ πβ ππ log π(β π ) = πβ(π) ππ vector of functions
Log-Likelihood Gradient Derivation
Log-Likelihood Derivative Derivation ππΊ π π¦ π , π§ β² π π§ β² π¦ π ) = ΰ· π π π¦ π , π§ π β ΰ· ΰ· π ππ π π§ β² π π
Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?
Preventing Extreme Values NaΓ―ve Bayes Extreme values are 0 probabilities
Preventing Extreme Values NaΓ―ve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large ΞΈ values
Preventing Extreme Values NaΓ―ve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large ΞΈ values regularization
(Squared) L2 Regularization
https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo Lesson 8
(More on) Connections to Other Machine Learning Techniques
Classification: Discriminative NaΓ―ve Bayes Label/class Observed features NaΓ―ve Bayes
Classification: Discriminative NaΓ―ve Bayes Label/class Observed features NaΓ―ve Bayes Maxent/ Logistic Regression
Multinomial Logistic Regression
Multinomial Logistic Regression (in one dimension)
Multinomial Logistic Regression
Understanding Conditioning Is this a good language model?
Understanding Conditioning Is this a good language model?
Understanding Conditioning Is this a good language model? (no)
Understanding Conditioning Is this a good posterior classifier? (no)
https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo Lesson 11
Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression as statistical regression Softmax regression based in Maximum Entropy models (MaxEnt) information theory Generalized Linear Models a form of Discriminative NaΓ―ve Bayes viewed as to be cool Very shallow (sigmoidal) neural nets today :)
Revisiting the S NAP Function softmax
Revisiting the S NAP Function softmax
N-gram Language Models given some context⦠w i-3 w i-2 w i-1 w i predict the next word
N-gram Language Models given some contextβ¦ w i-3 w i-2 w i-1 compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 ,π₯ πβ2 , π₯ πβ1 ) β πππ£ππ’(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 , π₯ π ) w i predict the next word
N-gram Language Models given some contextβ¦ w i-3 w i-2 w i-1 compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 , π₯ πβ2 , π₯ πβ1 ) β πππ£ππ’(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 , π₯ π ) w i predict the next word
Maxent Language Models given some contextβ¦ w i-3 w i-2 w i-1 compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 , π₯ πβ2 , π₯ πβ1 ) β softmax(π β π(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 , π₯ π )) w i predict the next word
Neural Language Models given some contextβ¦ w i-3 w i-2 w i-1 can we learn the feature function(s)? compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 , π₯ πβ2 , π₯ πβ1 ) β softmax(π β π(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 , π₯ π )) w i predict the next word
Neural Language Models given some contextβ¦ w i-3 w i-2 w i-1 can we learn the feature function(s) for just the context? compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 , π₯ πβ2 , π₯ πβ1 ) β softmax(π π π β π(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 )) can we learn word-specific weights (by type)? w i predict the next word
Neural Language Models given some contextβ¦ w i-3 w i-2 w i-1 create/use β distributed representationsββ¦ e w e i-3 e i-2 e i-1 compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 , π₯ πβ2 , π₯ πβ1 ) β softmax(π π₯ π β π(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 )) w i predict the next word
Neural Language Models given some contextβ¦ w i-3 w i-2 w i-1 create/use β distributed representationsββ¦ e w e i-3 e i-2 e i-1 combine these matrix-vector C = f representationsβ¦ product compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 , π₯ πβ2 , π₯ πβ1 ) β softmax(π π₯ π β π(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 )) w i predict the next word
Recommend
More recommend