semi supervised learning
play

Semi-Supervised Learning Maria-Florina Balcan 03/30/2015 Readings: - PowerPoint PPT Presentation

Semi-Supervised Learning Maria-Florina Balcan 03/30/2015 Readings: Semi-Supervised Learning. Encyclopedia of Machine Learning. Jerry Zhu, 2010 Combining Labeled and Unlabeled Data with Co- Training. Avrim Blum, Tom Mitchell. COLT


  1. Semi-Supervised Learning Maria-Florina Balcan 03/30/2015 Readings: Semi-Supervised Learning. Encyclopedia of Machine • Learning. Jerry Zhu, 2010 Combining Labeled and Unlabeled Data with Co- • Training. Avrim Blum, Tom Mitchell. COLT 1998.

  2. Fully Supervised Learning Data Distribution D on X Source Expert / Oracle Learning Algorithm Labeled Examples (x 1 ,c*(x 1 )),…, ( x m ,c*(x m )) c* : X ! Y Alg.outputs h : X ! Y x 1 > 5 + + - + - +1 + x 6 > 2 - - - - -1 +1

  3. Fully Supervised Learning Data Distribution D on X Source Expert / Oracle Learning Algorithm Labeled Examples (x 1 ,c*(x 1 )),…, ( x m ,c*(x m )) c* : X ! Y Alg.outputs h : X ! Y S l ={( x 1 , y 1 ) , …,( x m l , y m l )} Goal : h has small error over D. x i drawn i.i.d from D, y i = c ∗ (x i ) err D h = Pr x~ D (h x ≠ c ∗ (x))

  4. Two Core Aspects of Supervised Learning Computation Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. • E.g.: Naïve Bayes, logistic regression, SVM, Adaboost, etc. Confidence Bounds, Generalization (Labeled) Data Confidence for rule effectiveness on future data. • VC-dimension, Rademacher complexity, margin based bounds, etc.

  5. Classic Paradigm Insufficient Nowadays Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images

  6. Modern ML: New Learning Approaches Modern applications: massive amounts of raw data. Techniques that best utilize data, minimizing need for expert/human intervention. Paradigms where there has been great progress. • Semi-supervised Learning, (Inter)active Learning. Expert

  7. Semi-Supervised Learning Data Source Learning Expert / Oracle Unlabeled Algorithm Unlabeled examples examples Labeled Examples Algorithm outputs a classifier S l ={( x 1 , y 1 ) , …,( x m l , y m l )} Goal : h has small error over D. x i drawn i.i.d from D, y i = c ∗ (x i ) err D h = Pr x~ D (h x ≠ c ∗ (x)) S u ={ x 1 , …, x m u } drawn i.i.d from D

  8. Semi-supervised Learning • Major topic of research in ML. • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: – Transductive SVM [Joachims ’99] Test of time – Co-training [Blum & Mitchell ’98] awards at ICML ! – Graph-based methods [B&C01], [ZGL03] Workshops [ICML ’03, ICML’ 05, …] Books: Semi-Supervised Learning, MIT 2006 • O. Chapelle, B. Scholkopf and A. Zien (eds) Introduction to Semi-Supervised Learning, • Morgan & Claypool, 2009 Zhu & Goldberg

  9. Semi-supervised Learning • Major topic of research in ML. • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: – Transductive SVM [Joachims ’99] Test of time – Co-training [Blum & Mitchell ’98] awards at ICML ! – Graph-based methods [B&C01], [ZGL03] Both wide spread applications and solid foundational understanding!!!

  10. Semi-supervised Learning • Major topic of research in ML. • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: – Transductive SVM [Joachims ’99] Test of time – Co-training [Blum & Mitchell ’98] awards at ICML ! – Graph-based methods [B&C01], [ZGL03] Today: discuss these methods. Very interesting, they all exploit unlabeled data in different, very interesting and creative ways.

  11. Semi-supervised learning: no querying. Just have lots of additional unlabeled data. A bit puzzling; unclear what unlabeled data can do for us…. It is missing the most important info. How can it help us in substantial ways? Key Insight Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.

  12. Semi-supervised SVM [Joachims ’99]

  13. Margins based regularity Target goes through low density regions (large margin). • assume we are looking for linear separator • belief: should exist one with large separation _ + _ _ + + + _ + + _ _ SVM Transductive SVM Labeled data only

  14. Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and 0 unlabeled data. [Joachims ’99] 0 0 0 0 0 0 0 0 w’ 0 𝑥’ ⋅ 𝑦 = −1 0 0 0 0 0 + 0 0 0 0 0 + 0 0 0 0 - + Input: S l ={( x 1 , y 1 ) , …,( x m l , y m l )} 0 0 0 0 0 0 0 0 + 0 0 0 - 0 0 0 0 S u ={ x 1 , …, x m u } + - 0 0 0 0 0 0 0 0 0 0 - 0 0 0 0 0 0 0 0 0 - 0 0 0 0 2 s.t.: - 0 𝑥’ ⋅ 𝑦 = 1 0 0 - argmin w w 0 0 0 0 0 0 y i w ⋅ x i ≥ 1 , for all i ∈ {1, … , m l } 0 0 • 0 • y u w ⋅ x u ≥ 1 , for all u ∈ {1, … , m u } • y u ∈ {−1, 1} for all u ∈ {1, … , m u } Find a labeling of the unlabeled sample and 𝑥 s.t. 𝑥 separates both labeled and unlabeled data with maximum margin.

  15. Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and 0 unlabeled data. [Joachims ’99] 0 0 0 0 0 0 0 0 w’ 0 𝑥’ ⋅ 𝑦 = −1 0 0 0 0 0 + 0 0 0 0 0 + 0 0 0 0 - + Input: S l ={( x 1 , y 1 ) , …,( x m l , y m l )} 0 0 0 0 0 0 0 0 + 0 0 0 - 0 0 0 0 S u ={ x 1 , …, x m u } + - 0 0 0 0 0 0 0 0 0 0 - 0 0 0 0 0 0 0 0 0 - 0 2 + 𝐷 𝜊 𝑗 0 0 0 - 0 + 𝐷 𝜊 𝑣 𝑥’ ⋅ 𝑦 = 1 0 0 - argmin w w 0 0 0 𝑣 𝑗 0 0 0 y i w ⋅ x i ≥ 1 - 𝜊 𝑗 , for all i ∈ {1, … , m l } 0 0 • 0 , for all u ∈ {1, … , m u } • y u w ⋅ x u ≥ 1 − 𝜊 𝑣 • y u ∈ {−1, 1} for all u ∈ {1, … , m u } Find a labeling of the unlabeled sample and 𝑥 s.t. 𝑥 separates both labeled and unlabeled data with maximum margin.

  16. Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and 0 unlabeled data. Input: S l ={( x 1 , y 1 ) , …,( x m l , y m l )} S u ={ x 1 , …, x m u } 2 + 𝐷 𝜊 𝑗 + 𝐷 𝜊 𝑣 argmin w w 𝑣 𝑗 y i w ⋅ x i ≥ 1 - 𝜊 𝑗 , for all i ∈ {1, … , m l } • , for all u ∈ {1, … , m u } • y u w ⋅ x u ≥ 1 − 𝜊 𝑣 • y u ∈ {−1, 1} for all u ∈ {1, … , m u } NP- hard….. Convex only after you guessed the labels… too many possible guesses…

  17. Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and unlabeled data. Heuristic (Joachims) high level idea: • First maximize margin over the labeled points • Use this to give initial labels to unlabeled points based on this separator. • Try flipping labels of unlabeled points to see if doing so can increase margin Keep going until no more improvements. Finds a locally-optimal solution.

  18. Experiments [Joachims99]

  19. Transductive Support Vector Machines Helpful distribution + _ + Highly compatible _ Non-helpful distributions Margin not satisfied Margin satisfied 1/ ° 2 clusters, all partitions separable by large margin

  20. Co-training [Blum & Mitchell ’98] Different type of underlying regularity assumption: Consistency or Agreement Between Parts

  21. Co-training: Self-consistency Agreement between two parts : co-training [Blum-Mitchell98]. - examples contain two sufficient sets of features, x = h x 1 , x 2 i - belief: the parts are consistent, i.e. 9 c 1 , c 2 s.t. c 1 (x 1 )=c 2 (x 2 )=c * (x) For example, if we want to classify web pages: x = h x 1 , x 2 i as faculty member homepage or not Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor x - Link info & Text info x 2 - Link info x 1 - Text info

  22. Iterative Co-Training Idea : Use small labeled sample to learn initial rules. E.g., “my advisor” pointing to a page is a good indicator it is a • faculty home page. E.g., “I am teaching” on a page is a good indicator it is a faculty • home page. Idea : Use unlabeled data to propagate learned information. my advisor

  23. Iterative Co-Training Idea : Use small labeled sample to learn initial rules. E.g., “my advisor” pointing to a page is a good indicator it is a • faculty home page. E.g., “I am teaching” on a page is a good indicator it is a faculty • home page. Idea : Use unlabeled data to propagate learned information. Look for unlabeled examples where one rule is confident and the other is not. Have it label the example for the other. h x 1 ,x 2 i h x 1 ,x 2 i h x 1 ,x 2 i h x 1 ,x 2 i h x 1 ,x 2 i h x 1 ,x 2 i Training 2 classifiers, one on each type of info. Using each to help train the other.

  24. Iterative Co-Training X 2 X 1 Works by using unlabeled data to + propagate learned information. + h 1 h + • Have learning algos A 1 , A 2 on each of the two views. • Use labeled data to learn two initial hyp. h 1 , h 2 . Repeat • Look through unlabeled data to find examples where one of h i is confident but other is not. • Have the confident h i label it for algorithm A 3-i .

  25. Original Application: Webpage classification 12 labeled examples, 1000 unlabeled (sample run)

  26. Iterative Co-Training A Simple Example: Learning Intervals Labeled examples Unlabeled examples + h 2 1 c 2 - - c 1 h 1 1 Use labeled data to learn h 1 1 and h 2 1 Use unlabeled data to bootstrap h 2 2 h 2 1 h 1 2 h 1 2

  27. Expansion, Examples: Learning Intervals Consistency: zero probability mass in the regions c 1 c 2 Non-expanding (non-helpful) Expanding distribution distribution D + c 1 c 1 D + S 2 S 1 c 2 c 2

Recommend


More recommend