maxent models ii
play

Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 - PowerPoint PPT Presentation

Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 Announcements: Assignment 1 Due 11:59 PM, Saturday 9/23 ~3.5 days Use submit utility with: class id cs473_ferraro assignment id a1 We must be able to run it on GL! Common pitfall #1:


  1. Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017

  2. Announcements: Assignment 1 Due 11:59 PM, Saturday 9/23 ~3.5 days Use submit utility with: class id cs473_ferraro assignment id a1 We must be able to run it on GL! Common pitfall #1: forgetting files Common pitfall #2: incorrect paths to files Common pitfall #3: 3 rd party libraries

  3. Announcements: Question 6 π‘ž (π‘œ) 𝑦 𝑗 𝑦 π‘—βˆ’π‘œ+1 : 𝑦 π‘—βˆ’1 ) = πœ‡ π‘œ 𝑔 (π‘œ) 𝑦 π‘—βˆ’π‘œ+1 :𝑦 𝑗 + 1 βˆ’ πœ‡ π‘œ π‘ž (π‘œβˆ’1) (𝑦 𝑗 |𝑦 π‘—βˆ’π‘œ+2 :𝑦 π‘—βˆ’1 ) π‘ž (π‘œ) 𝑦 𝑗 𝑦 π‘—βˆ’π‘œ+1 : 𝑦 π‘—βˆ’1 ) = πœ‡ π‘œ,π‘œ 𝑔 (π‘œ) 𝑦 π‘—βˆ’π‘œ+1 :𝑦 𝑗 + πœ‡ π‘œ,π‘œβˆ’1 𝑔 (π‘œβˆ’1) 𝑦 π‘—βˆ’π‘œ+2 :𝑦 𝑗 + β‹― πœ‡ π‘œ,0 𝑔 (0) β‹… π‘œβˆ’1 πœ‡ π‘œ,0 = 1 βˆ’ ෍ πœ‡ π‘œ,π‘œβˆ’π‘› 𝑛=0

  4. Announcements: Course Project Official handout will be out Friday 9/22 Until then, focus on assignment 1 Teams of 1-3 Mixed undergrad/grad is encouraged but not required Some novel aspect is needed Ex 1: reimplement existing technique and apply to new domain Ex 2: reimplement existing technique and apply to new (human) language Ex 3: explore novel technique on existing problem

  5. Recap from last time…

  6. Classify or Decode with Bayes Rule how likely is label X overall? how well does text Y represent label X ? For β€œsimple” or β€œflat” labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score

  7. Classification Evaluation: the 2-by-2 contingency table Actually Actually Correct Incorrect Selected/ True Positive False Positive Guessed (TP) (FP) Guessed Guessed Correct Correct Not selected/ False Negative True Negative not guessed (FN) (TN) Guessed Guessed Correct Correct Classes/Choices

  8. Classification Evaluation: Accuracy, Precision, and Recall Accuracy : % of items correct Precision : % of selected items that are correct Recall : % of correct items that are selected Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

  9. Language Modeling as NaΓ―ve Bayes Classifier Adopt naΓ―ve bag of words representation Y i Assume position doesn’t matter Assume the feature probabilities are independent given the class X

  10. NaΓ―ve Bayes Summary Potential Advantages Potential Issues Model the posterior in one go? Very Fast, low storage requirements Robust to Irrelevant Features Are the features really uncorrelated? Very good in domains with many equally important features Are plain counts always appropriate? Optimal if the independence assumptions hold Are there β€œbetter” ways of handling missing/noisy data? Dependable baseline for text (automated, more principled) classification (but often not the best)

  11. NaΓ―ve Bayes Summary Potential Advantages Potential Issues Model the posterior in one go? Very Fast, low storage requirements Model the posterior in one go? Relevant for classification… Robust to Irrelevant Features Are the features really Are the features really uncorrelated? uncorrelated? Very good in domains with many equally important features Are plain counts always Are plain counts always appropriate? appropriate? Optimal if the independence assumptions hold Are there β€œbetter” ways of Are there β€œbetter” ways of handling missing/noisy data? handling missing/noisy data? Dependable baseline for text (automated, more principled) (automated, more principled) classification (but often not the best)

  12. NaΓ―ve Bayes Summary Potential Advantages Potential Issues Model the posterior in one go? Very Fast, low storage requirements Relevant for Model the posterior in one go? classification… and Robust to Irrelevant Features Are the features really language modeling Are the features really uncorrelated? uncorrelated? Very good in domains with many equally important features Are plain counts always Are plain counts always appropriate? appropriate? Optimal if the independence assumptions hold Are there β€œbetter” ways of Are there β€œbetter” ways of handling missing/noisy data? handling missing/noisy data? Dependable baseline for text (automated, more principled) (automated, more principled) classification (but often not the best)

  13. Maximum Entropy Models a more general language model argmax π‘Œ π‘ž 𝑍 π‘Œ) βˆ— π‘ž(π‘Œ)

  14. Maximum Entropy Models a more general language model argmax π‘Œ π‘ž 𝑍 π‘Œ) βˆ— π‘ž(π‘Œ) classify in one go argmax π‘Œ π‘ž π‘Œ 𝑍)

  15. Document Classification Three people have been A TTACK fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. We need to score the different combinations.

  16. Score and Combine Our Possibilities score 1 (fatally shot, A TTACK ) C OMBINE posterior score 2 (seriously wounded, A TTACK ) probability of score 3 (Shining Path, A TTACK ) A TTACK … score k (department, A TTACK ) … are all of these uncorrelated?

  17. Score and Combine Our Possibilities score 1 (fatally shot, A TTACK ) C OMBINE posterior score 2 (seriously wounded, A TTACK ) probability of score 3 (Shining Path, A TTACK ) A TTACK … Q: What are the score and combine functions for NaΓ―ve Bayes?

  18. Scoring Our Possibilities Three people have been fatally shot, and five people, including a score( , ) = mayor, were seriously wounded as A TTACK a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . score 1 (fatally shot, A TTACK ) Learn these scores… but score 2 (seriously wounded, A TTACK ) how? score 3 (Shining Path, A TTACK ) What do we optimize? …

  19. Maxent Modeling Three people have been fatally shot, and five people, including p( | ) ∝ a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Three people have been fatally S NAP ( score( , ) ) shot, and five people, including a mayor, were seriously wounded A TTACK as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

  20. Maxent Modeling Three people have been fatally p( | ) ∝ shot, and five people, including a mayor, were seriously wounded as A TTACK a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Three people have been fatally exp ( score( , ) ) shot, and five people, including a mayor, were seriously wounded A TTACK as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp gives a positive, unnormalized probability f(x) = exp(x)

  21. Maxent Modeling Three people have been fatally shot, and five people, including p( | ) ∝ a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) … Learn the scores (but we’ll declare what combinations should be looked at)

  22. Maxent Modeling Three people have been fatally shot, and five people, including p( | ) ∝ a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) weight 1 * applies 1 (fatally shot, A TTACK ) weight 2 * applies 2 (seriously wounded, A TTACK ) weight 3 * applies 3 (Shining Path, A TTACK ) …

  23. Q : What if none of our features apply?

  24. Guiding Principle for Log-Linear Models β€œ[The log -linear estimate] is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information .” Edwin T. Jaynes, 1957

  25. Guiding Principle for Log-Linear Models β€œ[The log -linear estimate] is the least biased estimate possible on the given information; i.e., it is exp( ΞΈΒ· f)  maximally noncommittal exp( ΞΈΒ· 0) = 1 with regard to missing information .” Edwin T. Jaynes, 1957

  26. Easier-to-write form exp ( ) ) ΞΈ 1 * f 1 (fatally shot, A TTACK ) ΞΈ 2 * f 2 (seriously wounded, A TTACK ) ΞΈ 3 * f 3 (Shining Path, A TTACK ) …

  27. Easier-to-write form exp ( ) ) ΞΈ 1 * f 1 (fatally shot, A TTACK ) ΞΈ 2 * f 2 (seriously wounded, A TTACK ) ΞΈ 3 * f 3 (Shining Path, A TTACK ) … K K weights features

  28. Easier-to-write form dot product exp ( ) ΞΈ Β·f ( doc , A TTACK ) K-dimensional K-dimensional weight vector feature vector

  29. Log-Linear Models

  30. Log-Linear Models

  31. Log-Linear Models Feature function(s) Sufficient statistics β€œStrength” function(s)

  32. Log-Linear Models Feature Weights Natural parameters Distribution Parameters

  33. Log-Linear Models How do we normalize?

Recommend


More recommend