statistical machine learning
play

Statistical Machine Learning Lecture 11: Support Vector Machines - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 11: Support Vector Machines Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 59 Todays Objectives


  1. Statistical Machine Learning Lecture 11: Support Vector Machines Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 59

  2. Today’s Objectives Covered Topics Linear Support Vector Classification Features and Kernels Non-Linear Support Vector Classification Outlook on Applications, Relevance Vector Machines and Support Vector Regression K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 59

  3. Outline 1. From Structural Risk Minimization to Linear SVMs 2. Nonlinear SVMs 3. Applications 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 59

  4. 1. From Structural Risk Minimization to Linear SVMs Outline 1. From Structural Risk Minimization to Linear SVMs 2. Nonlinear SVMs 3. Applications 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 59

  5. 1. From Structural Risk Minimization to Linear SVMs Structural Risk Minimization How can we implement structural risk minimization? R ( w ) ≤ R emp ( w ) + ǫ ( N , p ∗ , h ) where N is the number of training examples, p ∗ is the probability that the bound is met and h is the VC-dimension Classical Machine Learning algorithms Keep ǫ ( N , p ∗ , h ) constant and minimize R emp ( w ) ǫ ( N , p ∗ , h ) is fixed by keeping some model parameters fixed, e.g. the number of hidden neurons in a neural network (see later) Support Vector Machines (SVMs) Keep R emp ( w ) constant and minimize ǫ ( N , p ∗ , h ) In practice R emp ( w ) = 0 with separable data ǫ ( N , p ∗ , h ) is controlled by changing the VC-dimension (“capacity control”) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 59

  6. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines Linear classifiers (generalized later) Approximate implementation of the structural risk minimization principle If the data is linearly separable, the empirical risk of SVM classifiers will be zero, and the risk bound will be approximately minimized SVMs have built-in “guaranteed” generalization abilities K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 59

  7. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines For now assume linearly separable data N training data points i = 1 , with x i ∈ R d and y i ∈ {− 1 , 1 } { x i , y i } N Hyperplane that separates the data x 2 y > 0 y = 0 R 1 y < 0 y ( x ) = w ⊺ x + b R 2 x w y ( x ) k w k x ? x 1 − w 0 k w k Which hyperplane shall we use? How can we minimize the VC dimension? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 59

  8. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines Intuitively: We should find the hyperplane with the maximum “distance” to the data K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 59

  9. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines Maximizing the margin Why does that make sense? Why does it minimize the VC dimension? Key result (from Vapnik) If the data points lie in a sphere of radius R , � x i � < R , ... ...and the margin of the linear classifier in d dimensions is γ , then � 4 R 2 � �� h ≤ min d , γ 2 Maximizing the margin lowers a bound on the VC-dimension! K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 59

  10. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines Find a hyperplane so that the data is linearly separated y i ( w ⊺ x i + b ) ≥ 1 ∀ i Enforce y i ( w ⊺ x i + b ) = 1 for at least one data point K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 59

  11. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines x 2 y > 0 y = 0 R 1 y < 0 R 2 x w y ( x ) k w k x ? x 1 − w 0 k w k We can easily express the margin The distance to the hyperplane is y ( x i ) � w � = w ⊺ x i + b � w � (Note in the figure b = w 0 ) 1 Hence the margin is � w � K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 59

  12. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines y = − 1 y = 0 y = 1 Support vectors: all points that lie on the margin, i.e., y i ( w ⊺ x i + b ) = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 59

  13. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines Maximizing the margin 1 / � w � is equivalent to minimizing � w � 2 Formulate as constrained optimization problem 2 � w � 2 1 arg min w , b s.t. y i ( w ⊺ x i + b ) − 1 ≥ 0 ∀ i Lagrangian formulation N L ( w , b , α ) = 1 2 � w � 2 − � α i ( y i ( w ⊺ x i + b ) − 1 ) i = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 59

  14. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines N min L ( w , b , α ) = 1 2 � w � 2 − � α i ( y i ( w ⊺ x i + b ) − 1 ) i = 1 N ∂ L ( w , b , α ) � = 0 = ⇒ α i y i = 0 ∂ b i = 1 N ∂ L ( w , b , α ) � = 0 = ⇒ w = α i y i x i ∂ w i = 1 The separating hyperplane is a linear combination of the input data But what are the α i ? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 59

  15. 1. From Structural Risk Minimization to Linear SVMs Sparsity Important property y = − 1 Almost all the α i are zero y = 0 y = 1 There are only a few support vectors But the hyperplane was written as N � w = α i y i x i i = 1 SVMs are sparse learning machines The classifier only depends on a few data points K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 59

  16. 1. From Structural Risk Minimization to Linear SVMs Dual Form Let us rewrite the Lagrangian N 1 2 � w � 2 − � L ( w , b , α ) = α i ( y i ( w ⊺ x i + b ) − 1 ) i = 1 N N N 1 2 � w � 2 − � � � = α i y i w ⊺ x i − α i y i b + α i i = 1 i = 1 i = 1 We know that N � α i y i = 0 i = 1 Hence we have N N L ( w , α ) = 1 2 � w � 2 − ˆ � � α i y i w ⊺ x i + α i i = 1 i = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 59

  17. 1. From Structural Risk Minimization to Linear SVMs Dual Form N N L ( w , α ) = 1 2 � w � 2 − ˆ � � α i y i w ⊺ x i + α i i = 1 i = 1 Use the constraint w = � N i = 1 α i y i x i N N N 1 2 � w � 2 − ˆ � � � α j y j x ⊺ L ( w , α ) = α i y i j x i + α i i = 1 j = 1 i = 1 N N N 1 2 � w � 2 − � � � � � = α i α j y i y j x ⊺ + α i j x i i = 1 j = 1 i = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 59

  18. 1. From Structural Risk Minimization to Linear SVMs Dual Form We have also N N 1 2 � w � 2 = 1 2 w ⊺ w = 1 � � � � x ⊺ α i α j y i y j j x i 2 i = 1 j = 1 Finally we obtain the Wolfe dual formulation N N N α i − 1 � � ˜ � � � L ( α ) = α i α j y i y j x ⊺ j x i 2 i = 1 i = 1 j = 1 We can now solve the original problem by maximizing the dual function ˜ L K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 59

  19. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines - Dual Form N N N α i − 1 � � � � � min α i α j y i y j x ⊺ j x i 2 i = 1 i = 1 j = 1 s.t. α i ≥ 0 N � α i y i = 0 i = 1 The separating hyperplane is given by the N S support vectors N S � w = α i y i x i i = 1 b can also be computed, but we skip the derivation K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 59

  20. 1. From Structural Risk Minimization to Linear SVMs Support Vector Machines so far Both the original SVM formulation (primal) as well as the derived dual formulation are quadratic programming problems (quadratic cost, linear constraints), which have unique solutions that can be computed efficiently Why did we bother to derive the dual form? To go beyond linear classifiers! K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 59

  21. 2. Nonlinear SVMs Outline 1. From Structural Risk Minimization to Linear SVMs 2. Nonlinear SVMs 3. Applications 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 59

  22. 2. Nonlinear SVMs Nonlinear SVMs Nonlinear transformation φ of the data (features) x ∈ R d φ : R d → H Hyperplane H (linear classifier in H ) w ⊺ φ ( x ) + b = 0 Nonlinear classifier in R d Same trick as in least-squares regression. So what is so special here? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 22 / 59

Recommend


More recommend