from binary to extreme classification
play

From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. 28, 2019 1 Q&A Q: How do I get


  1. 10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University From Binary to Extreme Classification Matt Gormley Lecture 2 Aug. 28, 2019 1

  2. Q&A Q: How do I get into the online section? A: Sorry! I erroneously claimed we would automatically add you to the online section. Here’s the correct answer: To join the online section, email Dorothy Holland- Minkley at dfh@andrew.cmu.edu stating that you would like to join the online section. Why the extra step? We want to make sure you’ve seen the non-professional video recording and are okay with the quality. 2

  3. Q&A Q: Will I get off the waitlist? A: Don’t be on the waitlist. Just email Dorothy to join the online section instead! 3

  4. Q&A Q: Can I move between 10-418 and 10-618? A: Yes. Just email Dorothy Holland-Minkley at dfh@andrew.cmu.edu to do so. Q: When is the last possible moment I can move between 10-418 and 10-618? A: I’m not sure. We’ll announce on Piazza once I have an answer. 4

  5. QnA Populating Wikipedia Infoboxes Q: Why do interactions appear between variables in this example? A: Consider the test time setting: – Author writes a new article (vector x) – Infobox is empty – ML system must populate all fields (vector y ) at once – Interactions that were seen (i.e. vector y ) at training time are unobserved at test time – so we wish to model them 5

  6. ROADMAP 7

  7. How do we get from Classification to Structured Prediction? 1. We start with the simplest decompositions (i.e. classification ) 2. Then we formulate structured prediction as a search problem (decomposition of into a sequence of decisions ) 3. Finally, we formulate structured prediction in the framework of graphical models (decomposition into parts ) 8

  8. Sampling from a Joint Distribution A joint distribution defines a probability p ( x ) for each assignment of values x to variables X . This gives the proportion of samples that will equal x . n v p d n Sample 1: n n v d n Sample 2: n v p d n Sample 3: v n p d n Sample 4: v n v d n Sample 5: n v p d n Sample 6: ψ 2 ψ 4 ψ 6 ψ 8 ψ 0 X 0 X 1 X 2 X 3 X 4 X 5 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 like time flies an arrow 9

  9. Sampling from a Joint Distribution A joint distribution defines a probability p ( x ) for each assignment of values x to variables X . This gives the proportion of samples that will equal x . Sample 1: Sample 2: ψ 11 ψ 11 ψ 12 ψ 12 ψ 10 ψ 10 ψ 6 ψ 6 ψ 4 ψ 4 ψ 2 ψ 2 ψ 8 ψ 8 ψ 5 ψ 5 Sample 4: Sample 3: ψ 1 ψ 1 ψ 7 ψ 7 ψ 9 ψ 11 ψ 9 ψ 3 ψ 11 ψ 3 ψ 12 ψ 12 ψ 10 ψ 10 ψ 6 ψ 6 ψ 4 ψ 4 ψ 2 ψ 2 ψ 8 ψ 8 ψ 5 ψ 5 ψ 1 ψ 1 ψ 7 ψ 7 ψ 9 ψ 9 ψ 3 ψ 3 X 7 X 6 ψ 11 X 3 ψ 12 X 1 ψ 10 ψ 6 ψ 4 ψ 2 ψ 8 X 4 ψ 5 ψ 1 X 5 X 2 ψ 7 10 ψ 9 ψ 3

  10. Sampling from a Joint Distribution A joint distribution defines a probability p ( x ) for each assignment of values x to variables X . This gives the proportion of samples that will equal x . n v p d n Sample 1: time flies like an arrow n n v d n Sample 2: flies like time an arrow n v p n n Sample 3: with flies fly their wings p n n v v Sample 4: you with time will see ψ 6 X 0 ψ 0 ψ 2 ψ 4 X 3 ψ 8 X 5 X 1 X 2 X 4 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 W 1 W 2 W 3 W 4 W 5 11

  11. Factors have local opinions (≥ 0) Each black box looks at some of the tags X i and words W i Note: We chose to reuse v n p d v n p d the same factors at different positions in the v v 1 6 3 4 1 6 3 4 sentence. n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 X 0 ψ 2 ψ 4 X 3 ψ 6 ψ 8 X 5 ψ 0 X 1 X 2 X 4 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … W 1 W 2 W 3 W 4 W 5 v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 12

  12. Factors have local opinions (≥ 0) Each black box looks at some of the tags X i and words W i p ( n, v, p, d, n, time, flies, like, an, arrow ) = ? v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 13

  13. Global probability = product of local opinions Each black box looks at some of the tags X i and words W i p ( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) Uh-oh! The probabilities of v n p d v n p d the various assignments sum v v 1 6 3 4 1 6 3 4 up to Z > 1. n n 8 4 2 0.1 8 4 2 0.1 So divide them all by Z. p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 14

  14. Markov Random Field (MRF) Joint distribution over tags X i and words W i The individual factors aren’t necessarily probabilities. p ( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 15

  15. Hidden Markov Model But sometimes we choose to make them probabilities. Constrain each row of a factor to sum to one. Now Z = 1 . p ( n, v, p, d, n, time, flies, like, an, arrow ) = ( .3 * .8 * .2 * .5 * … ) v n p d v n p d v v .1 .4 .2 .3 .1 .4 .2 .3 n .8 .1 .1 n .8 .1 .1 0 0 p .2 .3 .2 .3 p .2 .3 .2 .3 d .2 .8 0 d .2 .8 0 0 0 p n n v d <START> time time flies flies like like … … time flies like an arrow v .2 .5 .2 v .2 .5 .2 n .3 .4 .2 n .3 .4 .2 p .1 .1 .3 p .1 .1 .3 d .1 .2 .1 d .1 .2 .1 16

  16. Markov Random Field (MRF) Joint distribution over tags X i and words W i p ( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 time time flies flies like like … … time flies like an arrow v v 3 5 3 3 5 3 n n 4 5 2 4 5 2 p 0.1 0.1 3 p 0.1 0.1 3 d 0.1 0.2 0.1 d 0.1 0.2 0.1 17

  17. Conditional Random Field (CRF) Conditional distribution over tags X i given words w i . The factors and Z are now specific to the sentence w . p ( n, v, p, d, n | time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) v n p d v n p d v v 1 6 3 4 1 6 3 4 n n 8 4 2 0.1 8 4 2 0.1 p p 1 3 1 3 1 3 1 3 d 0.1 8 d 0.1 8 0 0 0 0 p n n ψ 2 v ψ 4 ψ 6 d ψ 8 ψ 0 <START> ψ 1 ψ 3 ψ 5 ψ 7 ψ 9 v v 3 5 n n 4 5 p 0.1 p 0.1 d 0.1 d 0.2 time flies like an arrow 18

  18. BACKGROUND: BINARY CLASSIFICATION 19

  19. Linear Models for Classification Key idea: Try to learn this hyperplane directly • There are lots of Directly modeling the commonly used hyperplane would use a Linear Classifiers decision function: • These include: – Perceptron h ( � ) = sign ( θ T � ) – (Binary) Logistic Regression – Naïve Bayes (under for: certain conditions) y ∈ { − 1 , +1 } – (Binary) Support Vector Machines

  20. (Online) Perceptron Algorithm Data: Inputs are continuous vectors of length M . Outputs are discrete. Prediction: Output determined by hyperplane. � if a ≥ 0 1 , y = h θ ( x ) = sign( θ T x ) sign ( a ) = ˆ otherwise − 1 , Learning: Iterative procedure: • initialize parameters to vector of all zeroes • while not converged • receive next example ( x (i) , y (i) ) • predict y’ = h( x (i) ) • if positive mistake: add x (i) to parameters • if negative mistake: subtract x (i) from parameters 21

  21. (Binary) Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. Model: Logistic function applied to dot product of parameters with input vector. 1 p θ ( y = 1 | � ) = 1 + ��� ( − θ T � ) Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ Prediction: Output is the most probable class. y = ������ ˆ p θ ( y | � ) y ∈ { 0 , 1 } 22

  22. Support Vector Machines (SVMs) Hard-margin SVM (Primal) Hard-margin SVM (Lagrangian Dual) Soft-margin SVM (Primal) Soft-margin SVM (Lagrangian Dual) 23

  23. Decision Trees 24 Figure from Tom Mitchell

  24. Binary and Multiclass Classification Supervised Learning: Binary Classification: Multiclass Classification: 25

  25. Outline Reductions (Binary à Multiclass) Settings 1. one-vs-all (OVA) A. Multiclass Classification 2. all-vs-all (AVA) B. Hierarchical Classification 3. classification tree C. Extreme Classification 4. error correcting output codes (ECOC) Why ? – multiclass is the simplest structured prediction setting – key insights in the simple reductions are analogous to later (less simple) concepts 26

Recommend


More recommend