Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some - PowerPoint PPT Presentation

We need to score the different combinations. Three people have been A TTACK fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Score and Combine Our Possibilities score 1 (fatally shot, A TTACK ) C OMBINE posterior score 2 (seriously wounded, A TTACK ) probability of score 3 (Shining Path, A TTACK ) A TTACK … score k (department, A TTACK ) … are all of these uncorrelated?

Score and Combine Our Possibilities score 1 (fatally shot, A TTACK ) C OMBINE posterior score 2 (seriously wounded, A TTACK ) probability of score 3 (Shining Path, A TTACK ) A TTACK … Q: What are the score and combine functions for Naïve Bayes?

Scoring Our Possibilities Three people have been fatally shot, and five people, including a score( , ) = mayor, were seriously wounded as A TTACK a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) …

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/BQCdH9 Lesson 1

) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Three people have been fatally S NAP ( score( , ) ) shot, and five people, including a mayor, were seriously wounded A TTACK as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

What function… operates on any real number? is never less than 0?

What function… operates on any real number? is never less than 0? f(x) = exp(x)

) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Three people have been fatally exp ( score( , ) ) shot, and five people, including a mayor, were seriously wounded A TTACK as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) …

) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) … Learn the scores (but we’ll declare what combinations should be looked at)

) ∝ Maxent Modeling Three people have been fatally shot, and five people, including p( | a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) weight 1 * occurs 1 (fatally shot, A TTACK ) weight 2 * occurs 2 (seriously wounded, A TTACK ) weight 3 * occurs 3 (Shining Path, A TTACK ) …

) ∝ Maxent Modeling: Feature Functions p( | Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . weight 1 * occurs 1 (fatally shot, A TTACK ) exp ( ) ) weight 2 * occurs 2 (seriously wounded, A TTACK ) weight 3 * occurs 3 (Shining Path, A TTACK ) … Feature functions help occurs target,type fatally shot,ATTACK = extract useful features (characteristics) of the ቊ 1, target == fatally shot and type == ATTACK data 0, otherwise Generally templated Often binary-valued (0 binary or 1), but can be real- valued

More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = occurs target,type fatally shot, ATTACK = ቊ 1, target == fatally shot and type == ATTACK log 𝑞 fatally shot ATTACK) + log 𝑞 type ATTACK) 0, otherwise + log 𝑞(ATTACK |type) binary Templated real- valued occurs fatally shot, ATTACK = log 𝑞 fatally shot ATTACK) ??? Non-templated Non-templated real-valued count-valued

More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = occurs target,type fatally shot, ATTACK = ቊ 1, target == fatally shot and type == ATTACK log 𝑞 fatally shot ATTACK) + log 𝑞 type ATTACK) 0, otherwise + log 𝑞(ATTACK |type) binary Templated real- valued occurs fatally shot, ATTACK = occurs fatally shot, ATTACK = log 𝑞 fatally shot ATTACK) count fatally sho𝑢 ATTACK) Non-templated Non-templated real-valued count-valued

Maxent Modeling Three people have been fatally shot, and five people, including p( | ) = a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Q: How do we define Z? exp ( ) ) 1 weight 1 * applies 1 (fatally shot, A TTACK ) weight 2 * applies 2 (seriously wounded, A TTACK ) Z weight 3 * applies 3 (Shining Path, A TTACK ) …

Normalization for Classification Z = Σ exp ( weight 1 * occurs 1 (fatally shot, A TTACK ) ) weight 2 * occurs 2 (seriously wounded, A TTACK ) weight 3 * occurs 3 (Shining Path, A TTACK ) … label x 𝑞 𝑦 𝑧) ∝ exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 ) classify doc y with label x in one go

Normalization for Language Model general class-based (X) language model of doc y

Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case

Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case Simplifying assumption: maxent n-grams!

Understanding Conditioning Is this a good language model?

Understanding Conditioning Is this a good language model? (no)

Understanding Conditioning Is this a good posterior classifier? (no)

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/BQCdH9 Lesson 11

Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

p θ (x | y ) probabilistic model objective (given observations)

Objective = Full Likelihood? Differentiating this These values can have very product could be a pain small magnitude ➔ underflow

Logarithms (0, 1] ➔ (- ∞, 0] Products ➔ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b) Inverse of exp log(exp(x)) = x

Log-Likelihood Wide range of (negative) numbers Sums are more stable Products ➔ Sums log(ab) = log(a) + log(b) Differentiating this log(a/b) = log(a) – log(b) becomes nicer (even though Z depends on θ )

Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x Differentiating this becomes nicer (even though Z depends on θ )

Log-Likelihood Wide range of (negative) numbers Sums are more stable = 𝐺 𝜄

How will we optimize F( θ )? Calculus

F( θ ) θ

F( θ ) θ θ *

F( θ ) F’(θ ) derivative of F wrt θ θ θ *

Example F(x) = -(x-2) 2 differentiate F’(x) = -2x + 4 Solve F’(x) = 0 x = 2

Common Derivative Rules

What if you can’t find the roots? Follow the derivative F( θ ) F’(θ ) derivative of F wrt θ θ θ *

What if you can’t find the roots? Follow the derivative Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) θ θ 0 θ *

What if you can’t find the roots? Follow the derivative Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) g 0 θ θ 0 θ *

What if you can’t find the roots? Follow the derivative Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t g 0 5. Set t += 1 θ θ 0 θ 1 θ *

What if you can’t find the roots? Follow the derivative y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t g 0 g 1 5. Set t += 1 θ θ 0 θ 2 θ 1 θ *

What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged: of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t g 0 g 1 5. Set t += 1 g 2 θ θ 0 θ 2 θ 3 θ 1 θ *

What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( θ ) F’(θ ) Pick a starting value θ t y 0 derivative Until converged : of F wrt θ 1. Get value y t = F( θ t ) 2. Get derivative g t = F’(θ t ) 3. Get scaling factor ρ t g 0 4. Set θ t+1 = θ t + ρ t *g t g 1 g 2 θ 5. Set t += 1 θ 0 θ 2 θ 3 θ 1 θ *

Gradient = Multi-variable derivative K-dimensional input K-dimensional output

Gradient Ascent

Expectations number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some - PowerPoint PPT Presentation

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Nave Bayes (NB) classification Terminology: bag-of-words Nave assumption

Overview MAXENT-Modeling: A framework for Discrete MAXENT-Models and RMs IRT-Modeling?

From Maxent to Machine Learning and Back T. Sears ANU March 2007 T. Sears (ANU) From Maxent to

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

Nave Bayes & Maxent Models CMSC 473/673 UMBC September 18 th , 2017 Some slides adapted

MaxEnt Models and Discriminative Estimation Gerald Penn CS224N/Ling284 [based on slides by

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

Duality in a maximum generalized entropy model Shinto Eguchi Osamu Komori Atsumi Ohara MaxEnt

ASYMMETRIC INFORMATION PART 2 ADVERSE SELECTION EXAMPLE OF FAILURE OF EQUILIBRIUM

An argumentation-based approach to generate domain-specifjc explanations Nadin Kkciyan, Simon

Sevenoaks DC, Tonbridge & Malling BC, Tunbridge Wells BC and Kent County Council Cllr Michelle

Solutions for Public Bodies Event: Spotlight on Complaints Handling A guide to best practice

International Trade Centre Dia Dhuit! Mary Rodgers Classic Irish American. US Market entry and

January 29, 2013 Agenda What makes an effective message What the research tells us

1 Factor Analysis (FA): quick recap To recap, the FA model is defined by a low-dimensional

DerivingVia or or, Ho , How t to T o Tur urn H n Hand and-Writt itten en Ins nstan

Sambuz

Useful Links

Newsletter

Mail Us

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some - PowerPoint PPT Presentation

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Nave Bayes (NB) classification Terminology: bag-of-words Nave assumption

Overview MAXENT-Modeling: A framework for Discrete MAXENT-Models and RMs IRT-Modeling?

From Maxent to Machine Learning and Back T. Sears ANU March 2007 T. Sears (ANU) From Maxent to

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Maxent Models (III), &amp; Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

Nave Bayes &amp; Maxent Models CMSC 473/673 UMBC September 18 th , 2017 Some slides adapted

MaxEnt Models and Discriminative Estimation Gerald Penn CS224N/Ling284 [based on slides by

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

Duality in a maximum generalized entropy model Shinto Eguchi Osamu Komori Atsumi Ohara MaxEnt

ASYMMETRIC INFORMATION PART 2 ADVERSE SELECTION EXAMPLE OF FAILURE OF EQUILIBRIUM

An argumentation-based approach to generate domain-specifjc explanations Nadin Kkciyan, Simon

Sevenoaks DC, Tonbridge &amp; Malling BC, Tunbridge Wells BC and Kent County Council Cllr Michelle

Solutions for Public Bodies Event: Spotlight on Complaints Handling A guide to best practice

International Trade Centre Dia Dhuit! Mary Rodgers Classic Irish American. US Market entry and

January 29, 2013 Agenda What makes an effective message What the research tells us

1 Factor Analysis (FA): quick recap To recap, the FA model is defined by a low-dimensional

DerivingVia or or, Ho , How t to T o Tur urn H n Hand and-Writt itten en Ins nstan

Sambuz

Useful Links

Newsletter

Mail Us

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

Nave Bayes & Maxent Models CMSC 473/673 UMBC September 18 th , 2017 Some slides adapted

Sevenoaks DC, Tonbridge & Malling BC, Tunbridge Wells BC and Kent County Council Cllr Michelle