Maxent Models (III), & Neural Language Models CMSC 473/673 - PowerPoint PPT Presentation

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some slides adapted from 3SLP

Recap from last time…

Maximum Entropy Models a more general language model argmax 𝑌 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) classify in one go argmax 𝑌 𝑞 𝑌 𝑍)

Maximum Entropy Models Feature Weights Natural parameters Distribution Parameters Feature function(s) Sufficient statistics “Strength” function(s)

What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t g 0 g 1 5. Set t += 1 g 2 θ θ 0 θ 2 θ 3 θ 1 θ *

What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged : of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t g 0 4. Set θ t+1 = θ t + ρ t *g t g 1 g 2 θ 5. Set t += 1 θ 0 θ 2 θ 3 θ 1 θ *

Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression as statistical regression Softmax regression based in Maximum Entropy models (MaxEnt) information theory Generalized Linear Models a form of Discriminative Naïve Bayes viewed as to be cool Very shallow (sigmoidal) neural nets today :)

https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo

Objective = Full Likelihood?

Objective = Full Likelihood? Differentiating this These values can have very product could be a pain small magnitude  underflow

Logarithms (0, 1]  (- ∞, 0] Products  Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b) Inverse of exp log(exp(x)) = x

Log-Likelihood Wide range of (negative) numbers Sums are more stable Products  Sums log(ab) = log(a) + log(b) Differentiating this log(a/b) = log(a) – log(b) becomes nicer (even though Z depends on θ )

Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x 𝑞 𝑧 𝑦) ∝ exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 ) Differentiating this becomes nicer (even though Z depends on θ )

Log-Likelihood Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on θ )

Expectations number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

Log-Likelihood Gradient Each component k is the difference between:

Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data

Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p θ thinks it computes for feature f k

Log-Likelihood Gradient “moment Each component k is the difference matching” between: the total value of feature f k in the training data and the total value the current model p θ thinks it computes for feature f k

https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo Lesson 6

Log-Likelihood Gradient Derivation

Log-Likelihood Gradient Derivation depends on θ

Log-Likelihood Gradient Derivation use the (calculus) chain rule 𝜖 𝜖𝑕 𝜖ℎ 𝜖𝜄 log 𝑕(ℎ 𝜄 ) = 𝜖ℎ(𝜄) 𝜖𝜄

Log-Likelihood Gradient Derivation use the (calculus) chain rule scalar p(y’ | x i ) 𝜖 𝜖𝑕 𝜖ℎ 𝜖𝜄 log 𝑕(ℎ 𝜄 ) = 𝜖ℎ(𝜄) 𝜖𝜄 vector of functions

Log-Likelihood Derivative Derivation 𝜖𝐺 𝑙 𝑦 𝑗 , 𝑧 ′ 𝑞 𝑧 ′ 𝑦 𝑗 ) = ෍ 𝑔 𝑙 𝑦 𝑗 , 𝑧 𝑗 − ෍ ෍ 𝑔 𝜖𝜄 𝑙 𝑧 ′ 𝑗 𝑗

Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

Preventing Extreme Values Naïve Bayes Extreme values are 0 probabilities

Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values

Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values regularization

(Squared) L2 Regularization

(More on) Connections to Other Machine Learning Techniques

Classification: Discriminative Naïve Bayes Label/class Observed features Naïve Bayes

Classification: Discriminative Naïve Bayes Label/class Observed features Naïve Bayes Maxent/ Logistic Regression

Multinomial Logistic Regression

Multinomial Logistic Regression (in one dimension)

Multinomial Logistic Regression

Understanding Conditioning Is this a good language model?

Understanding Conditioning Is this a good language model? (no)

Understanding Conditioning Is this a good posterior classifier? (no)

Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression as statistical regression Softmax regression based in Maximum Entropy models (MaxEnt) information theory Generalized Linear Models a form of Discriminative Naïve Bayes viewed as to be cool Very shallow (sigmoidal) neural nets today :)

Revisiting the S NAP Function softmax

N-gram Language Models given some context… w i-3 w i-2 w i-1 w i predict the next word

N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 ,𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 ) w i predict the next word

N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 ) w i predict the next word

Maxent Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) w i predict the next word

Neural Language Models given some context… w i-3 w i-2 w i-1 can we learn the feature function(s)? compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) w i predict the next word

Neural Language Models given some context… w i-3 w i-2 w i-1 can we learn the feature function(s) for just the context? compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 𝒙 𝒋 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) can we learn word-specific weights (by type)? w i predict the next word

Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) w i predict the next word

Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f representations… product compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) w i predict the next word

Maxent Models (III), & Neural Language Models CMSC 473/673 - PowerPoint PPT Presentation

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some slides adapted from 3SLP Recap from last time Maximum Entropy Models a more general language model argmax ) ()

Overview MAXENT-Modeling: A framework for Discrete MAXENT-Models and RMs IRT-Modeling?

From Maxent to Machine Learning and Back T. Sears ANU March 2007 T. Sears (ANU) From Maxent to

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

Duality in a maximum generalized entropy model Shinto Eguchi Osamu Komori Atsumi Ohara MaxEnt

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline

MaxEnt Models and Discriminative Estimation Gerald Penn CS224N/Ling284 [based on slides by

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Models of Language Evolution models thereof its evolution language Models of Language Evolution

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Maxent Models, Conditional Introduction Estimation, and Optimization In recent years there

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Restructuring anab::ParticleID Kirsty Duffy and Adam Lister 1 2 Introduction The current

1 SimFlock: An object oriented model Sampling from the hyper distribution Breeding animals Draw

Logistic mixed models for DIF IRT models can be regarded as logistic mixed models (e.g., Adams,

Machine Learning Lecture 03: Logistic Regression and Gradient Descent Nevin L. Zhang

Out line Learning f rom complet e Dat a St at ist ical Learning EM algor it hm (part I

!" I'J''B K 'L''''''''B "' M'2 " '

Using Log-linear Models for Tuning Machine Translation Output Michael Carl IAI L REC 2008 1

The real story of the film so far... X a continuous random variable : for all x , { X x } is an

Maxent Models (III), & Neural Language Models CMSC 473/673 - PowerPoint PPT Presentation

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some slides adapted from 3SLP Recap from last time Maximum Entropy Models a more general language model argmax ) ()

Overview MAXENT-Modeling: A framework for Discrete MAXENT-Models and RMs IRT-Modeling?

From Maxent to Machine Learning and Back T. Sears ANU March 2007 T. Sears (ANU) From Maxent to

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

Duality in a maximum generalized entropy model Shinto Eguchi Osamu Komori Atsumi Ohara MaxEnt

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline

MaxEnt Models and Discriminative Estimation Gerald Penn CS224N/Ling284 [based on slides by

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Models of Language Evolution models thereof its evolution language Models of Language Evolution

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Maxent Models, Conditional Introduction Estimation, and Optimization In recent years there

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Restructuring anab::ParticleID Kirsty Duffy and Adam Lister 1 2 Introduction The current

1 SimFlock: An object oriented model Sampling from the hyper distribution Breeding animals Draw

Logistic mixed models for DIF IRT models can be regarded as logistic mixed models (e.g., Adams,

Machine Learning Lecture 03: Logistic Regression and Gradient Descent Nevin L. Zhang

Out line Learning f rom complet e Dat a St at ist ical Learning EM algor it hm (part I

!&quot; I'J''B K 'L''''''''B &quot;' M'2 &quot; '

Using Log-linear Models for Tuning Machine Translation Output Michael Carl IAI L REC 2008 1

The real story of the film so far... X a continuous random variable : for all x , { X x } is an

!" I'J''B K 'L''''''''B "' M'2 " '