active learning for
play

Active Learning for Supervised Classification Maria-Florina Balcan - PowerPoint PPT Presentation

Active Learning for Supervised Classification Maria-Florina Balcan Carnegie Mellon University Active Learning of Linear Separators Maria-Florina Balcan Carnegie Mellon University Two Minute Version Modern applications: massive amounts of raw


  1. Active Learning for Supervised Classification Maria-Florina Balcan Carnegie Mellon University

  2. Active Learning of Linear Separators Maria-Florina Balcan Carnegie Mellon University

  3. Two Minute Version Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images Active Learning : utilize data, minimize expert intervention. Expert

  4. Two Minute Version Active Learning: technique for best utilizing data while minimizing need for human intervention. + This talk : AL for classification, label efficient, + - + + - noise tolerant, poly time algo for learning linear + - - - separators [Balcan- Long COLT’13] [Awasthi-Balcan- Long STOC’14] - + [Awasthi-Balcan-Haghtalab-Urner COLT’15 ] Much better noise tolerance than previously known for • classic passive learning via poly time algos. [KKMS’05] [KLS’09] Solve an adaptive sequence of convex optimization pbs on • smaller & smaller bands around current guess for target. Exploit structural properties of log-concave distributions. •

  5. Passive and Active Learning

  6. Supervised Learning E.g., which emails are spam and which are important. • spam Not spam E.g., classify objects as chairs vs non chairs. • Not chair chair

  7. Statistical / PAC learning model Data Source Distribution D on X Expert / Oracle Learning Algorithm Labeled Examples (x 1 ,c*(x 1 )),…, ( x m ,c*(x m )) c* : X ! {0,1} - + + - h : X ! {0,1} + - - + - Algo sees (x 1 ,c*(x 1 )),…, ( x m ,c*(x m )), x i i.i.d. from D • • Does optimization over S, finds hypothesis h 2 C. • Goal: h has small error, err(h)=Pr x 2 D (h(x)  c*(x)) • c* in C, realizable case; else agnostic

  8. Two Main Aspects in Classic Machine Learning Algorithm Design. How to optimize? 8 Automatically generate rules that do well on observed data. E.g., Boosting, SVM, etc. Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. 1 1 1 O ϵ VCdim C log ϵ + log δ + + - 1 1 1 + C= linear separators in R d : O ϵ d log ϵ + log + - + - δ - - - +

  9. Two Main Aspects in Classic Machine Learning Algorithm Design. How to optimize? 9 Automatically generate rules that do well on observed data. 1 1 Runing time: poly d, δ ϵ , Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. 1 1 1 O ϵ VCdim C log ϵ + log δ + + - 1 1 1 + C= linear separators in R d : O ϵ d log ϵ + log + - + - δ - - - +

  10. Modern ML: New Learning Approaches Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images

  11. Active Learning Data Source Expert Unlabeled Learning examples Algorithm Request for the Label of an Example A Label for that Example Request for the Label of an Example A Label for that Example . . . Algorithm outputs a classifier • Learner can choose specific examples to be labeled. • Goal: use fewer labeled examples [pick informative examples to be labeled].

  12. Active Learning in Practice • Text classification: active SVM ( Tong & Koller, ICML2000) . e.g., request label of the example closest to current separator. • • Video Segmentation ( Fathi-Balcan-Ren-Regh, BMVC 11) .

  13. Can adaptive querying help? [CAL92, Dasgupta04] • Threshold fns on the real line: h w (x) = 1(x ¸ w), C = {h w : w 2 R} - + Active Algorithm w • Get N = O(1/ϵ) unlabeled examples • How can we recover the correct labels with ≪ N queries? • Do binary search! Just need O(log N) labels! + - - • Output a classifier consistent with the N inferred labels. Passive supervised: Ω(1/ϵ) labels to find an  -accurate threshold. Active: only O(log 1/ϵ) labels. Exponential improvement.

  14. Active learning, provable guarantees Lots of exciting results on sample complexity E.g., • “ D isagreement based” algorithms surviving classifiers Pick a few points at random from the current region of disagreement (uncertainty), query region of their labels, throw out hypothesis if you are disagreement statistically confident they are suboptimal. [ BalcanBeygelzimerLangford’06, Hanneke07, DasguptaHsuMontleoni’07 , Wang’09, Fridman’09 , Koltchinskii10, BHW’08, BeygelzimerHsuLangfordZhang’10 , Hsu’10, Ailon’12, …] Generic (any class), adversarial label noise. suboptimal in label complexity • computationally prohibitive. •

  15. Poly Time, Noise Tolerant/Agnostic, Label Optimal AL Algos .

  16. Margin Based Active Learning Margin based algo for learning linear separators • Realizable: exponential improvement, only O(d log 1/  ) labels to find w error  when D logconcave. [Balcan-Long COLT 2013] • Agnostic & malicious noise: poly-time AL algo outputs w with err(w) =O( ´ ) , ´ =err( best lin. sep). [Awasthi-Balcan-Long STOC 2014] • First poly time AL algo in noisy scenarios! • First for malicious noise [Val85] (features corrupted too). • Improves on noise tolerance of previous best passive [KKMS’05], [KLS’09] algos too!

  17. Margin Based Active-Learning, Realizable Case Draw m 1 unlabeled examples, label them, add them to W(1). iterate k = 2, …, s • find a hypothesis w k-1 consistent with W(k-1). • W(k)=W(k-1).  1 • sample m k unlabeled samples x satisfying |w k-1 ¢ x| ·  k-1 • label them and add them to W(k). w 2 w 3 w 1  2

  18. Margin Based Active-Learning, Realizable Case Log-concave distributions: log of density fnc concave. • wide class: uniform distr. over any convex set, Gaussian, etc. Theorem D log-concave in R d . If then err(w s ) ·  after rounds using labels per round. Active learning Passive learning label requests label requests unlabeled examples

  19. Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) · 1/2 k w w *

  20. Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) · 1/2 k Suboptimal w w w k-1 w k-1 w * w *  k-1  k-1

  21. Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) · 1/2 k w w k-1 · 1/2 k+1 w *  k-1

  22. Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) · 1/2 k w w k-1 · 1/2 k+1 w *  k-1 Enough to ensure Need only labels in round k. Key point: localize aggressively, while maintaining correctness.

  23. Margin Based Active-Learning, Agnostic Case Draw m 1 unlabeled examples, label them, add them to W. iterate k=2, …, s • find w k-1 in B(w k-1 , r k-1 ) of small Localization in t k-1 hinge loss wrt W . concept space. • Clear working set. • sample m k unlabeled samples x satisfying |w k-1 ¢ x| ·  k-1 ; • label them and add them to W. end iterate Localization in instance space. Analysis, key idea: • Pick 𝜐 𝑙 ≈ 𝛿 𝑙 • Localization & variance analysis control the gap between hinge loss and 0/1 loss (only a constant) .

  24. Improves over Passive Learning too! Passive Learning Prior Work Our Work Malicious err(w) = 𝑃 𝜃 1/3 log 2/3 𝑒 1 err(w) = 𝑃 𝜃 log 2 𝜃 𝜃 [Awasthi-Balcan- Long’14 ] [ KLS’09] err(w) = 𝑃 𝜃 1/3 log 1/3 1 err(w) = 𝑃 𝜃 log 2 1 Agnostic 𝜃 𝜃 [ KLS’09] [Awasthi-Balcan- Long’14 ] 𝜃 + 𝜗 NA Bounded Noise ≥  P Y = 1 x − P Y = −1 x [Awasthi-Balcan-Haghtalab- Urner’15 ] Active Learning NA same as above! [agnostic/malicious/ bounded] [Awasthi-Balcan- Long’14 ]

  25. Improves over Passive Learning too! Passive Learning Prior Work Our Work err(w) = 𝑃(𝜃𝑒 1/4 ) err(w) = 𝑃(𝜃) [KKMS’05] Malicious Info theoretic optimal err(w) = 𝑃 (𝑒/𝜃) 𝜃 log [Awasthi-Balcan- Long’14 ] [KLS’09] err(w) = 𝑃 𝜃 log (1/𝜃) err(w) = 𝑃(𝜃) Agnostic [KKMS’05] [Awasthi-Balcan- Long’14 ] 𝜃 + 𝜗 Bounded Noise NA [Awasthi-Balcan-Haghtalab- Urner’15 ] P Y = 1 x − P Y = −1 x ≥  Active Learning same as above! [agnostic/malicious/ NA bounded] Info theoretic optimal [Awasthi-Balcan- Long’14 ] Slightly better results for the uniform distribution case.

  26. Localization both algorithmic and analysis tool! Useful for active and passive learning!

  27. Discussion, Open Directions • Active learning: important modern learning paradigm. • First poly time, label efficient AL algo for agnostic learning in high dimensional cases. • Also leads to much better noise tolerant algos for passive learning of linear separators! Open Directions More general distributions, other concept spaces. • • Exploit localization insights in other settings (e.g., online convex optimization with adversarial noise).

Recommend


More recommend