pattern recognition 2018 support vector machines
play

Pattern Recognition 2018 Support Vector Machines Ad Feelders - PowerPoint PPT Presentation

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 48


  1. Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48

  2. Support Vector Machines Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 48

  3. Overview 1 Separable Case 2 Kernel Functions 3 Allowing Errors (Soft Margin) 4 SVM’s in R . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 3 / 48

  4. Linear Classifier for two classes Linear model y ( x ) = w ⊤ φ ( x ) + b (7.1) with t n ∈ {− 1 , +1 } . Predict t 0 = +1 if y ( x 0 ) ≥ 0 and t 0 = − 1 otherwise. The decision boundary is given by y ( x ) = 0. This is a linear classifier in feature space φ ( x ). Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 4 / 48

  5. Mapping y ( x ) = w ⊤ φ ( x ) + b = 0 φ φ maps x into higher dimensional space where data is linearly separable. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 5 / 48

  6. Data linearly separable Assume training data is linearly separable in feature space, so there is at least one choice of w , b such that: 1 y ( x n ) > 0 for t n = +1; 2 y ( x n ) < 0 for t n = − 1; that is, all training points are classified correctly. Putting 1. and 2. together: t n y ( x n ) > 0 for n = 1 , . . . , N Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 6 / 48

  7. Maximum Margin There may be many solutions that separate the classes exactly. Which one gives smallest prediction error? SVM chooses line with maximal margin , where the margin is the distance between the line and the closest data point. In this way, it avoids “low confidence” classifications. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 7 / 48

  8. Two-class training data Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 8 / 48

  9. Many Linear Separators Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 9 / 48

  10. SVM Decision Boundary Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 10 / 48

  11. Maximize Margin Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 11 / 48

  12. Support Vectors Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 12 / 48

  13. Weight vector is orthogonal to the decision boundary Consider two points x A and x B both of which lie on the decision surface. Because y ( x A ) = y ( x B ) = 0, we have ( w ⊤ x A + b ) − ( w ⊤ x B + b ) = w ⊤ ( x A − x B ) = 0 and so the vector w is orthogonal to the decision surface. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 13 / 48

  14. Distance of a point to a line x 2 x y ( x ) = w ⊤ x + b = 0 r w x ⊥ x 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 14 / 48

  15. Distance to decision surface ( φ ( x ) = x ) We have x = x ⊥ + r w � w � . (4.6) where � w � is the unit vector in the direction of w , x ⊥ is the orthogonal w projection of x onto the line y ( x ) = 0, and r is the (signed) distance of x to the line. Multiply (4.6) left and right by w ⊤ and add b : + r w ⊤ w w ⊤ x + b = w ⊤ x ⊥ + b � �� � � �� � � w � y ( x ) 0 So we get r = y ( x ) � w � � w � 2 = y ( x ) (4.7) � w � Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 15 / 48

  16. Distance of a point to a line The signed distance of x n to the decision boundary is r = y ( x n ) � w � For lines that separate the data perfectly, we have t n y ( x n ) = | y ( x n ) | , so that the distance is given by = t n ( w ⊤ φ ( x n ) + b ) t n y ( x n ) (7.2) � w � � w � Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 16 / 48

  17. Maximum margin solution Now we are ready to define the optimization problem: � � t n ( w ⊤ φ ( x n ) + b ) �� arg max min . (7.3) � w � n w , b 1 Since � w � does not depend on n , it can be moved outside of the minimization: � 1 � n [ t n ( w ⊤ φ ( x n ) + b )] arg max � w � min . (7.3) w , b Direct solution of this problem would be rather complex. A more convenient representation is possible. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 17 / 48

  18. Canonical Representation The hyperplane (decision boundary) is defined by w ⊤ φ ( x ) + b = 0 Then also κ ( w ⊤ φ ( x ) + b ) = κ w ⊤ φ ( x ) + κ b = 0 so rescaling w → κ w and b → κ b gives just another representation of the same decision boundary. To resolve this ambiguity, we choose the scaling factor such that t i ( w ⊤ φ ( x i ) + b ) = 1 (7.4) for the points x i closest to the decision boundary. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 18 / 48

  19. Canonical Representation (square=1,circle= − 1) y ( x ) = 1 y ( x ) = 0 y ( x ) = − 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 19 / 48

  20. Canonical Representation In this case we have t n ( w ⊤ φ ( x n ) + b ) ≥ 1 n = 1 , . . . , N (7.5) Quadratic program 1 2 � w � 2 arg min (7.6) w , b subject to the constraints (7.5). This optimization problem has a unique global minimum. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 20 / 48

  21. Lagrangian Function Introduce Lagrange multipliers a n ≥ 0 to get Lagrangian function N � L ( w , b , a ) = 1 2 � w � 2 − a n { t n ( w ⊤ φ ( x n ) + b ) − 1 } (7.7) n =1 with N � ∂ L ( w , b , a ) = w − a n t n φ ( x n ) ∂ w n =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 21 / 48

  22. Lagrangian Function and for b : N � ∂ L ( w , b , a ) = − a n t n ∂ b n =1 Equating the derivatives to zero yields the conditions: N � w = a n t n φ ( x n ) (7.8) n =1 and N � a n t n = 0 (7.9) n =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 22 / 48

  23. Dual Representation Eliminating w and b from L ( w , b , a ) gives the dual representation . N � 1 2 � w � 2 − a n { t n ( w ⊤ φ ( x n ) + b ) − 1 } L ( w , b , a ) = n =1 N N N � � � 1 2 � w � 2 − a n t n w ⊤ φ ( x n ) − b = a n t n + a n n =1 n =1 n =1 N N � � 1 a n a m t n t m φ ( x n ) ⊤ φ ( x m ) = 2 n =1 m =1 N N N � � � a n t n a m t m φ ( x n ) ⊤ φ ( x m ) + − a n n =1 m =1 n =1 N N N � � � a n − 1 a n t n a m t m φ ( x n ) ⊤ φ ( x m ) = 2 n =1 n =1 m =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 23 / 48

  24. Dual Representation Maximize N N � � a n − 1 � a n t n a m t m φ ( x n ) ⊤ φ ( x m ) L ( a ) = (7.10) 2 n =1 n , m =1 with respect to a and subject to the constraints a n ≥ 0 , n = 1 , . . . , N (7.11) N � a n t n = 0 . (7.12) n =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 24 / 48

  25. Kernel Function We map x to a high-dimensional space φ ( x ) in which data is linearly separable. Performing computations in this high-dimensional space may be very expensive. Use a kernel function k that computes a dot product in this space without making the actual mapping (“kernel trick”): k ( x , x ′ ) = φ ( x ) ⊤ φ ( x ′ ) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 25 / 48

  26. Example: polynomial kernel R 3 and φ ( x ) ∈ I R 10 with Suppose x ∈ I √ √ √ √ √ √ 2 x 3 , x 2 1 , x 2 2 , x 2 φ ( x ) = (1 , 2 x 1 , 2 x 2 , 3 , 2 x 1 x 2 , 2 x 1 x 3 , 2 x 2 x 3 ) Then φ ( x ) ⊤ φ ( z ) 1 + 2 x 1 z 1 + 2 x 2 z 2 + 2 x 3 z 3 + x 2 1 z 2 1 + x 2 2 z 2 2 + x 2 3 z 2 = 3 + 2 x 1 x 2 z 1 z 2 + 2 x 1 x 3 z 1 z 3 + 2 x 2 x 3 z 2 z 3 But this can be written as (1 + x ⊤ z ) 2 = (1 + x 1 z 1 + x 2 z 2 + x 3 z 3 ) 2 which costs much less operations to compute. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 26 / 48

  27. Polynomial kernel: numeric example Suppose x = (3 , 2 , 6) and z = (4 , 1 , 5). Then √ √ √ √ √ √ φ ( x ) = (1 , 3 2 , 2 2 , 6 2 , 9 , 4 , 36 , 6 2 , 18 2 , 12 2) √ √ √ √ √ √ φ ( z ) = (1 , 4 2 , 1 2 , 5 2 , 16 , 1 , 25 , 4 2 , 20 2 , 5 2) Then φ ( x ) ⊤ φ ( z ) = 1 + 24 + 4 + 60 + 144 + 4 + 900 + 48 + 720 + 120 = 2025 . But (1 + x ⊤ z ) 2 = (1 + (3)(4) + (2)(1) + (6)(5)) 2 = 45 2 = 2025 is a more efficient way to compute this dot product. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 27 / 48

  28. Kernels Linear kernel k ( x , x ′ ) = x ⊤ x ′ Two popular non-linear kernels are the polynomial kernel (of degree M ): k ( x , x ′ ) = ( x ⊤ x ′ + c ) M and Gaussian (or radial) kernel: k ( x , x ′ ) = exp( −� x − x ′ � 2 / 2 σ 2 ) , (6.23) or k ( x , x ′ ) = exp( − γ � x − x ′ � 2 ) , 1 where γ = 2 σ 2 . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 28 / 48

  29. Dual Representation with kernels Using k ( x , x ′ ) = φ ( x ) ⊤ φ ( x ′ ) we get dual representation: Maximize N N � � a n − 1 � L ( a ) = a n t n a m t m k ( x n , x m ) (7.10) 2 n =1 n , m =1 with respect to a and subject to the constraints a n ≥ 0 , n = 1 , . . . , N (7.11) N � a n t n = 0 . (7.12) n =1 Is this dual “easier” than the original problem? Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 29 / 48

  30. Prediction Recall that y ( x ) = w ⊤ φ ( x ) + b (7.1) Substituting N � w = a n t n φ ( x n ) (7.8) n =1 into (7.1), we get N � y ( x ) = b + a n t n k ( x , x n ) (7.13) n =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 30 / 48

Recommend


More recommend