announcements homework
play

Announcements - Homework Homework 1 is graded, please collect at end - PowerPoint PPT Presentation

Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution HW1 total score 40 35 30 25 20 15 10 5


  1. Announcements - Homework • Homework 1 is graded, please collect at end of lecture • Homework 2 due today • Homework 3 out soon (watch email) • Ques 1 – midterm review

  2. HW1 score distribution HW1 total score 40 35 30 25 20 15 10 5 0 0~10 10~20 20~30 30~40 40~50 50~60 60~70 70~80 80~90 90~100 100~110 2

  3. Announcements - Midterm • When: Wednesday, 10/20 • Where: In Class • What: You, your pencil, your textbook, your notes, course slides, your calculator, your good mood :) • What NOT: No computers, iphones, or anything else that has an internet connection. • Material: Everything from the beginning of the semester, until, and including SVMs and the Kernel trick 3

  4. Recitation Tomorrow! • Boosting, SVM (convex optimization), Midterm review ! • Strongly recommended!! • Place: NSH 3305 ( Note: change from last time ) • Time: 5-6 pm Rob

  5. Support Vector Machines Aarti Singh Machine Learning 10-701/15-781 Oct 13, 2010

  6. At Pittsburgh G- 20 summit … 6

  7. Linear classifiers – which line is better? 7

  8. Pick the one with the largest margin! 8

  9. Parameterizing the decision boundary w . x =  j w (j) x (j) w . x + b < 0 w . x + b > 0 Example i (= 1,2,…,n): Data: 9

  10. Parameterizing the decision boundary w . x + b < 0 w . x + b > 0 10

  11. Maximizing the margin w . x + b > 0 w . x + b < 0 Distance of closest examples from the line/hyperplane margin = g = 2a/ ǁwǁ g g 11

  12. Maximizing the margin w . x + b > 0 w . x + b < 0 Distance of closest examples from the line/hyperplane margin = g = 2a/ ǁwǁ max g = 2a/ ǁwǁ g g w , b s.t. ( w . x j + b ) y j ≥ a  j Note: ‘a’ is arbitrary (can normalize equations by a) 12

  13. Support Vector Machines w . x + b > 0 w . x + b < 0 min w . w w , b s.t. ( w . x j + b ) y j ≥ 1  j Solve efficiently by quadratic programming (QP) g – Well-studied solution g algorithms Linear hyperplane defined by “ support vectors ” 13

  14. Support Vectors w . x + b > 0 w . x + b < 0 Linear hyperplane defined by “ support vectors ” Moving other points a little doesn’t effect the decision boundary only need to store the support vectors to predict g labels of new points g How many support vectors in linearly separable case? ≤ m+1 14

  15. What if data is not linearly separable? Use features of features of features of features…. 2 , x 2 2 , x 1 x 2 , …., exp(x 1 ) x 1 But run risk of overfitting! 15

  16. What if data is still not linearly separable? Allow “error” in classification min w . w + C #mistakes w , b s.t. ( w . x j + b ) y j ≥ 1  j Maximize margin and minimize # mistakes on training data C - tradeoff parameter Not QP  0/1 loss (doesn’t distinguish between near miss and bad mistake) 16

  17. What if data is still not linearly separable? Allow “error” in classification min w . w + C Σξ j j w , b s.t. ( w . x j + b ) y j ≥ 1 - ξ j  j  j ξ j ≥ 0 ξ j - “slack” variables = (>1 if x j misclassifed) pay linear penalty if mistake C - tradeoff parameter (chosen by Soft margin approach cross-validation) Still QP  17

  18. Slack variables – Hinge loss Complexity penalization min w . w + C Σξ j j w , b s.t. ( w . x j + b ) y j ≥ 1 - ξ j  j  j ξ j ≥ 0 Hinge loss 0-1 loss 0 1 -1 18

  19. SVM vs. Logistic Regression SVM : Hinge loss Logistic Regression : Log loss ( -ve log conditional likelihood) Log loss Hinge loss 0-1 loss -1 0 1 19

  20. What about multiple classes? 20

  21. One against all Learn 3 classifiers separately: Class k vs. rest ( w k , b k ) k=1,2,3 y = arg max w k .x + b k k But w k s may not be based on the same scale. Note: (a w) .x + (ab) is also a solution 21

  22. Learn 1 classifier: Multi-class SVM Simultaneously learn 3 sets of weights Margin - gap between correct class and nearest other class y = arg max w (k) .x + b (k) 22

  23. Learn 1 classifier: Multi-class SVM Simultaneously learn 3 sets of weights y = arg max w (k) .x + b (k) Joint optimization: w k s have the same scale. 23

  24. What you need to know • Maximizing margin • Derivation of SVM formulation • Slack variables and hinge loss • Relationship between SVMs and logistic regression – 0/1 loss – Hinge loss – Log loss • Tackling multiple class – One against All – Multiclass SVMs 24

  25. SVMs reminder Regularization Hinge loss min w . w + C Σξ j w , b s.t. ( w . x j + b ) y j ≥ 1 - ξ j  j  j ξ j ≥ 0 Soft margin approach 25

  26. Today’s Lecture • Learn one of the most interesting and exciting recent advancements in machine learning – The “kernel trick” – High dimensional feature spaces at no extra cost! • But first, a detour – Constrained optimization! 26

  27. Constrained Optimization 27

  28. Lagrange Multiplier – Dual Variables Moving the constraint to objective function Lagrangian: Solve: Constraint is tight when a > 0 28

  29. Duality Primal problem: Dual problem: Weak duality – For all feasible points Strong duality – (holds under KKT conditions) 29

  30. Lagrange Multiplier – Dual Variables b +ve b -ve Solving: When a > 0, constraint is tight 30

Recommend


More recommend