introduction to machine learning
play

Introduction to Machine Learning 5. Support Vector Classification - PowerPoint PPT Presentation

Introduction to Machine Learning 5. Support Vector Classification Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Support Vector Classification Large Margin Separation, optimization


  1. Introduction to Machine Learning 5. Support Vector Classification Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701

  2. Outline • Support Vector Classification Large Margin Separation, optimization problem • Properties Support Vectors, kernel expansion • Soft margin classifier Dual problem, robustness

  3. Support Vector Machines http://maktoons.blogspot.com/2009/03/support-vector-machine.html

  4. Linear Separator Ham Spam

  5. Linear Separator Ham Spam

  6. Linear Separator Ham Spam

  7. Linear Separator Ham Spam

  8. Linear Separator Ham Spam

  9. Linear Separator Ham Spam

  10. Linear Separator Ham Spam

  11. Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 linear function f ( x ) = h w, x i + b

  12. Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w margin h x + � x − , w i 1 1 = 2 k w k [[ h x + , w i + b ] � [ h x − , w i + b ]] = 2 k w k k w k

  13. Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 maximize k w k subject to y i [ h x i , w i + b ] � 1 w,b

  14. Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b

  15. Dual Problem • Primal optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b constraint • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i Optimality in w, b is at saddle point with α • Derivatives in w, b need to vanish

  16. Dual Problem • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i • Derivatives in w, b need to vanish X ∂ w L ( w, b, a ) = w − α i y i x i = 0 i X ∂ b L ( w, b, a ) = α i y i = 0 i • Plugging terms back into L yields � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i X α i y i = 0 and α i � 0 subject to

  17. Support Vector Machines 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i X α i y i = 0 and α i � 0 subject to i

  18. Support Vectors 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w Karush Kuhn Tucker α i = 0 Optimality condition α i > 0 = ) y i [ h w, x i i + b ] = 1 α i [ y i [ h w, x i i + b ] � 1] = 0

  19. Properties X w = y i α i x i i w • Weight vector w as weighted linear combination of instances • Only points on margin matter (ignore the rest and get same solution) • Only inner products matter • Quadratic program • We can replace the inner product by a kernel • Keeps instances away from the margin

  20. Example

  21. Example

  22. Why large margins? • Maximum robustness relative o to uncertainty r o o • Symmetry breaking + • Independent of correctly classified o + instances ρ • Easy to find for + easy problems +

  23. Support Vector CLASSIFIERS Machines

  24. Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 linear function f ( x ) = h w, x i + b

  25. Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 linear function f ( x ) = h w, x i + b

  26. Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 linear separator linear function is impossible f ( x ) = h w, x i + b

  27. Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard

  28. Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard

  29. Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 minimum error separator is impossible Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard

  30. Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ Convex optimization problem

  31. Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ Convex optimization problem

  32. Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ minimize amount of slack Convex optimization problem

  33. Intermezzo Convex Programs for Dummies • Primal optimization problem minimize f ( x ) subject to c i ( x ) ≤ 0 x • Lagrange function X L ( x, α ) = f ( x ) + α i c i ( x ) i • First order optimality conditions in x X ∂ x L ( x, α ) = ∂ x f ( x ) + α i ∂ x c i ( x ) = 0 i • Solve for x and plug it back into L maximize L ( x ( α ) , α ) α (keep explicit constraints)

  34. Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ Convex optimization problem

  35. Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ Convex optimization problem

  36. Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ minimize amount of slack Convex optimization problem

  37. Adding slack variables • Hard margin problem 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b • With slack variables 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 Problem is always feasible. Proof: (also yields upper bound) w = 0 and b = 0 and ξ i = 1

  38. Dual Problem • Primal optimization problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 • Lagrange function L ( w, b, α ) = 1 2 k w k 2 + C X X X ξ i � α i [ y i [ h x i , w i + b ] + ξ i � 1] � η i ξ i i i i Optimality in w,b, ξ is at saddle point with α , η • Derivatives in w,b, ξ need to vanish

  39. Dual Problem • Lagrange function L ( w, b, α ) = 1 2 k w k 2 + C X X X ξ i � α i [ y i [ h x i , w i + b ] + ξ i � 1] � η i ξ i i i i • Derivatives in w, b need to vanish X ∂ w L ( w, b, ξ , α , η ) = w − α i y i x i = 0 i X ∂ b L ( w, b, ξ , α , η ) = α i y i = 0 i ∂ ξ i L ( w, b, ξ , α , η ) = C − α i − η i = 0 • Plugging terms back into L yields bound � 1 X X maximize α i α j y i y j h x i , x j i + influence α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i

  40. Karush Kuhn Tucker Conditions � 1 X X maximize α i α j y i y j h x i , x j i + α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i X w = y i α i x i i w α i = 0 = ) y i [ h w, x i i + b ] � 1 α i [ y i [ h w, x i i + b ] + ξ i � 1] = 0 0 < α i < C = ) y i [ h w, x i i + b ] = 1 η i ξ i = 0 α i = C = ) y i [ h w, x i i + b ]  1

  41. C=1

  42. C=2

  43. C=5

  44. C=10

  45. C=20

  46. C=50

  47. C=100

  48. C=1

  49. C=2

  50. C=5

  51. C=10

  52. C=20

  53. C=50

  54. C=100

  55. C=1

  56. C=2

  57. C=5

  58. C=10

  59. C=20

  60. C=50

  61. C=100

  62. C=1

  63. C=2

  64. C=5

  65. C=10

  66. C=20

  67. C=50

  68. C=100

  69. Solving the optimization problem • Dual problem � 1 X X maximize α i α j y i y j h x i , x j i + α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i • If problem is small enough (1000s of variables) we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO) • For larger problem use fact that only SVs matter and solve in blocks (active set method).

  70. Nonlinear Separation

  71. The Kernel Trick • Linear soft margin problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 • Dual problem � 1 X X maximize α i α j y i y j h x i , x j i + α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i • Support vector expansion X f ( x ) = α i y i h x i , x i + b i

  72. The Kernel Trick • Linear soft margin problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, φ ( x i ) i + b ] � 1 � ξ i and ξ i � 0 • Dual problem − 1 X X maximize α i α j y i y j k ( x i , x j ) + α i 2 α i,j i X subject to α i y i = 0 and α i ∈ [0 , C ] • Support vector expansion i X f ( x ) = α i y i k ( x i , x ) + b i

  73. C=1

  74. C=2

  75. C=5

  76. C=10

  77. C=20

  78. C=50

  79. C=100

  80. C=1

  81. C=2

  82. C=5

  83. C=10

  84. C=20

  85. C=50

  86. C=100

  87. C=1

  88. C=2

  89. C=5

  90. C=10

  91. C=20

  92. C=50

  93. C=100

  94. C=1

  95. C=2

  96. C=5

  97. C=10

  98. C=20

  99. C=50

  100. C=100

Recommend


More recommend