non parametric methods and support vector machines
play

Non-Parametric Methods and Support Vector Machines Shan-Hung Wu - PowerPoint PPT Presentation

Non-Parametric Methods and Support Vector Machines Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine


  1. Parzen Windows and Kernels Binary KNN classifier: ⇣ ∑ i : x ( i ) 2 KNN ( x ) y ( i ) ⌘ f ( x ) = sign The “radius” of voter boundary depends on the input x We can instead use the Parzen window with a fixed radius: ⇣ ∑ i y ( i ) 1 ( x ( i ) ; k x ( i ) � x k  R ) ⌘ f ( x ) = sign Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42

  2. Parzen Windows and Kernels Binary KNN classifier: ⇣ ∑ i : x ( i ) 2 KNN ( x ) y ( i ) ⌘ f ( x ) = sign The “radius” of voter boundary depends on the input x We can instead use the Parzen window with a fixed radius: ⇣ ∑ i y ( i ) 1 ( x ( i ) ; k x ( i ) � x k  R ) ⌘ f ( x ) = sign Parzen windows also replace the hard boundary with a soft one: ⇣ ⌘ ∑ i y ( i ) k ( x ( i ) , x ) f ( x ) = sign k ( x ( i ) , x ) is a radial basis function (RBF) kernel whose value decreases along space radiating outward from x Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42

  3. Parzen Windows and Kernels Binary KNN classifier: ⇣ ∑ i : x ( i ) 2 KNN ( x ) y ( i ) ⌘ f ( x ) = sign The “radius” of voter boundary depends on the input x We can instead use the Parzen window with a fixed radius: ⇣ ∑ i y ( i ) 1 ( x ( i ) ; k x ( i ) � x k  R ) ⌘ f ( x ) = sign Parzen windows also replace the hard boundary with a soft one: ⇣ ⌘ ∑ i y ( i ) k ( x ( i ) , x ) f ( x ) = sign k ( x ( i ) , x ) is a radial basis function (RBF) kernel whose value decreases along space radiating outward from x Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42

  4. Common RBF Kernels How to act like soft K -NN? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42

  5. Common RBF Kernels How to act like soft K -NN? Gaussian RBF kernel: k ( x ( i ) , x ) = N ( x ( i ) � x ; 0 , σ 2 I ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42

  6. Common RBF Kernels How to act like soft K -NN? Gaussian RBF kernel: k ( x ( i ) , x ) = N ( x ( i ) � x ; 0 , σ 2 I ) Or simply ⇣ � γ k x ( i ) � x k 2 ⌘ k ( x ( i ) , x ) = exp γ � 0 (or σ 2 ) is a hyperparameter controlling the smoothness of f Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42

  7. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 12 / 42

  8. Locally Weighted Linear Regression In addition to the majority voting and average, we can define local models for lazy predictions Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 13 / 42

  9. Locally Weighted Linear Regression In addition to the majority voting and average, we can define local models for lazy predictions E.g., in (eager) linear regression, we find w 2 R D + 1 that minimizes SSE: ( y ( i ) � w > x ( i ) ) 2 w ∑ argmin i Local model: to find w minimizing SSE local to the point x we want to predict : k ( x ( i ) , x )( y ( i ) � w > x ( i ) ) 2 w ∑ argmin i k ( · , · ) 2 R is an RBF kernel Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 13 / 42

  10. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 14 / 42

  11. Kernel Machines Kernel machines : N c i k ( x ( i ) , x )+ c 0 ∑ f ( x ) = i = 1 For example: Parzen windows: c i = y ( i ) and c 0 = 0 Locally weighted linear regression: c i = ( y ( i ) � w > x ( i ) ) 2 and c 0 = 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 15 / 42

  12. Kernel Machines Kernel machines : N c i k ( x ( i ) , x )+ c 0 ∑ f ( x ) = i = 1 For example: Parzen windows: c i = y ( i ) and c 0 = 0 Locally weighted linear regression: c i = ( y ( i ) � w > x ( i ) ) 2 and c 0 = 0 The variable c 2 R N can be learned in either an eager or lazy manner Pros: complex, but highly accurate if regularized well Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 15 / 42

  13. Sparse Kernel Machines To make a prediction, we need to store all examples May be infeasible due to Large dataset ( N ) Time limit Space limit Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 16 / 42

  14. Sparse Kernel Machines To make a prediction, we need to store all examples May be infeasible due to Large dataset ( N ) Time limit Space limit Can we make c sparse ? I.e., to make c i 6 = 0 for only a small fraction of examples called support vectors How? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 16 / 42

  15. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 17 / 42

  16. Separating Hyperplane I Model: F = { f : f ( x ; w , b ) = w > x + b } A collection of hyperplanes Prediction: ˆ y = sign ( f ( x )) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42

  17. Separating Hyperplane I Model: F = { f : f ( x ; w , b ) = w > x + b } A collection of hyperplanes Prediction: ˆ y = sign ( f ( x )) Training: to find w and b such that w > x ( i ) + b � 0 , if y ( i ) = 1 w > x ( i ) + b  0 , if y ( i ) = � 1 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42

  18. Separating Hyperplane I Model: F = { f : f ( x ; w , b ) = w > x + b } A collection of hyperplanes Prediction: ˆ y = sign ( f ( x )) Training: to find w and b such that w > x ( i ) + b � 0 , if y ( i ) = 1 w > x ( i ) + b  0 , if y ( i ) = � 1 or simply y ( i ) ( w > x ( i ) + b ) � 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42

  19. Separating Hyperplane II There are many feasible w ’s and b ’s when the classes are linearly separable Which hyperplane is the best? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 19 / 42

  20. Support Vector Classification Support vector classifier (SVC) picks one with largest margin : y ( i ) ( w > x ( i ) + b ) � a for all i Margin: 2 a / k w k [Homework] Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 20 / 42

  21. Support Vector Classification Support vector classifier (SVC) picks one with largest margin : y ( i ) ( w > x ( i ) + b ) � a for all i Margin: 2 a / k w k [Homework] With loss of generality, we let a = 1 and solve the problem: argmin w , b 1 2 k w k 2 sibject to y ( i ) ( w > x ( i ) + b ) � 1 , 8 i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 20 / 42

  22. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 21 / 42

  23. Overlapping Classes In practice, classes may be overlapping Due to, e.g., noises or outliers Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 22 / 42

  24. Overlapping Classes In practice, classes may be overlapping Due to, e.g., noises or outliers The problem argmin w , b 1 2 k w k 2 sibject to y ( i ) ( w > x ( i ) + b ) � 1 , 8 i has no solution in this case. How to fix this? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 22 / 42

  25. Slacks SVC tolerates slacks that fall outside of the regions they ought to be Problem: 1 2 k w k 2 + C ∑ N i = 1 ξ i argmin w , b , ξ sibject to y ( i ) ( w > x ( i ) + b ) � 1 � ξ i and ξ i � 0 , 8 i Favors large margin but also fewer slacks Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 23 / 42

  26. Hyperparameter C 2 k w k 2 + C ∑ N 1 i = 1 ξ i argmin w , b , ξ The hyperparameter C controls the tradeo ff between Maximizing margin Minimizing number of slacks Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 24 / 42

  27. Hyperparameter C 2 k w k 2 + C ∑ N 1 i = 1 ξ i argmin w , b , ξ The hyperparameter C controls the tradeo ff between Maximizing margin Minimizing number of slacks Provides a geometric explanation to the weight decay Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 24 / 42

  28. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 25 / 42

  29. Nonlinearly Separable Classes In practice, classes may be nonlinearly separable Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 26 / 42

  30. Nonlinearly Separable Classes In practice, classes may be nonlinearly separable SVC (with slacks) gives “bad” hyperplanes due to underfitting How to make it nonlinear? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 26 / 42

  31. Feature Augmentation Recall that in polynomial regression, we augment data features to make a linear regressor nonlinear Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 27 / 42

  32. Feature Augmentation Recall that in polynomial regression, we augment data features to make a linear regressor nonlinear We can can define a function Φ ( · ) that maps each data point to a high dimensional space: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 27 / 42

  33. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 28 / 42

  34. Time Complexity Nonlinear SVC: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i The higher augmented feature dimension, the more variables in w to solve Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 29 / 42

  35. Time Complexity Nonlinear SVC: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i The higher augmented feature dimension, the more variables in w to solve Can we solve w in time complexity that is independent with the mapped dimension? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 29 / 42

  36. Dual Problem Primal problem: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i Dual problem: argmax α , β min w , b , ξ L ( w , b , ξ , α , β ) subject to α � 0 , β � 0 where L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 30 / 42

  37. Dual Problem Primal problem: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i Dual problem: argmax α , β min w , b , ξ L ( w , b , ξ , α , β ) subject to α � 0 , β � 0 where L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Primal problem is convex, so strong duality holds Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 30 / 42

  38. Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

  39. Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: ∂ w = w � ∑ i α i y ( i ) Φ ( x ( i ) ) = 0 ) w = ∑ i α i y ( i ) Φ ( x ( i ) ) ∂ L Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

  40. Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: ∂ w = w � ∑ i α i y ( i ) Φ ( x ( i ) ) = 0 ) w = ∑ i α i y ( i ) Φ ( x ( i ) ) ∂ L ∂ b = ∑ i α i y ( i ) = 0 ∂ L Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

  41. Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: ∂ w = w � ∑ i α i y ( i ) Φ ( x ( i ) ) = 0 ) w = ∑ i α i y ( i ) Φ ( x ( i ) ) ∂ L ∂ b = ∑ i α i y ( i ) = 0 ∂ L ∂ L ∂ξ i = C � α i � β i = 0 ) β i = C � α i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

  42. Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

  43. Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i ∑ i α i � 1 2 ∑ i , j α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) , 8 > if ∑ i α i y ( i ) = 0 , > < w , b , ξ L ( w , b , ξ , α , β ) = min � ∞ , > > : otherwise Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

  44. Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i ∑ i α i � 1 2 ∑ i , j α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) , 8 > if ∑ i α i y ( i ) = 0 , > < w , b , ξ L ( w , b , ξ , α , β ) = min � ∞ , > > : otherwise Outer maximization problem: argmax α 1 > α � 1 2 α > K α subject to 0  α  C 1 and y > α = 0 K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

  45. Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i ∑ i α i � 1 2 ∑ i , j α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) , 8 > if ∑ i α i y ( i ) = 0 , > < w , b , ξ L ( w , b , ξ , α , β ) = min � ∞ , > > : otherwise Outer maximization problem: argmax α 1 > α � 1 2 α > K α subject to 0  α  C 1 and y > α = 0 K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) β i = C � α i � 0 implies α i  C Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

  46. Solving Dual Problem II Dual minimization problem of SVC: argmin α 1 2 α > K α � 1 > α subject to 0  α  C 1 and y > α = 0 Number of variables to solve? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42

  47. Solving Dual Problem II Dual minimization problem of SVC: argmin α 1 2 α > K α � 1 > α subject to 0  α  C 1 and y > α = 0 Number of variables to solve? N instead of augmented feature dimension Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42

  48. Solving Dual Problem II Dual minimization problem of SVC: argmin α 1 2 α > K α � 1 > α subject to 0  α  C 1 and y > α = 0 Number of variables to solve? N instead of augmented feature dimension In practice, this problem is solved by specialized solvers such as the sequential minimal optimization (SMO) [3] As K is usually ill-conditioned Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42

  49. Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

  50. Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

  51. Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? By the complementary slackness of KKT conditions, we have: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

  52. Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? By the complementary slackness of KKT conditions, we have: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 For any x ( i ) having 0 < α i < C , we have β i = C � α i > 0 ) ξ i = 0 , ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 ) b = y ( i ) � w > Φ ( x ( i ) ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

  53. Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? By the complementary slackness of KKT conditions, we have: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 For any x ( i ) having 0 < α i < C , we have β i = C � α i > 0 ) ξ i = 0 , ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 ) b = y ( i ) � w > Φ ( x ( i ) ) In practice, we usually take the average over all x ( i ) ’s having 0 < α i < C to avoid numeric error Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

  54. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 35 / 42

  55. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  56. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  57. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  58. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Polynomial kernel: k ( a , b ) = ( a > b / α + β ) γ E.g., let α = 1 , β = 1 , γ = 2 and a 2 R 2 , then p p p 2 a 1 a 2 ] > 2 R 6 2 a 2 , a 2 1 , a 2 Φ ( a ) = [ 1 , 2 a 1 , 2 , Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  59. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Polynomial kernel: k ( a , b ) = ( a > b / α + β ) γ E.g., let α = 1 , β = 1 , γ = 2 and a 2 R 2 , then p p p 2 a 1 a 2 ] > 2 R 6 2 a 2 , a 2 1 , a 2 Φ ( a ) = [ 1 , 2 a 1 , 2 , Gaussian RBF kernel: k ( a , b ) = exp ( � γ k a � b k 2 ) , γ � 0 k ( a , b ) = exp ( � γ k a k 2 + 2 γ a > b � γ k b k 2 )= exp ( � γ k a k 2 � γ k b k 2 )( 1 + 2 γ a > b + ( 2 γ a > b ) 2 + ··· ) 1! 2! Let a 2 R 2 , then Φ ( a ) = q q q q q 2! a 1 a 2 , ··· ] > 2 R ∞ 2 γ 2 γ 2 γ 2 γ γ exp ( � γ k a k 2 )[ 1 , 2! a 2 2! a 2 1! a 1 , 1! a 2 , 1 , 2 , 2 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  60. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Polynomial kernel: k ( a , b ) = ( a > b / α + β ) γ E.g., let α = 1 , β = 1 , γ = 2 and a 2 R 2 , then p p p 2 a 1 a 2 ] > 2 R 6 2 a 2 , a 2 1 , a 2 Φ ( a ) = [ 1 , 2 a 1 , 2 , Gaussian RBF kernel: k ( a , b ) = exp ( � γ k a � b k 2 ) , γ � 0 k ( a , b ) = exp ( � γ k a k 2 + 2 γ a > b � γ k b k 2 )= exp ( � γ k a k 2 � γ k b k 2 )( 1 + 2 γ a > b + ( 2 γ a > b ) 2 + ··· ) 1! 2! Let a 2 R 2 , then Φ ( a ) = q q q q q 2! a 1 a 2 , ··· ] > 2 R ∞ 2 γ 2 γ 2 γ 2 γ γ exp ( � γ k a k 2 )[ 1 , 2! a 2 2! a 2 1! a 1 , 1! a 2 , 1 , 2 , 2 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  61. Kernel Trick If we choose Φ induced by Polynomial or Gaussian RBF kernel, then K i , j = y ( i ) y ( j ) k ( x ( i ) , x ) takes only O ( D ) time to evaluate, and f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i takes O ( ND ) time Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42

  62. Kernel Trick If we choose Φ induced by Polynomial or Gaussian RBF kernel, then K i , j = y ( i ) y ( j ) k ( x ( i ) , x ) takes only O ( D ) time to evaluate, and f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i takes O ( ND ) time Independent with the augmented feature dimension Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42

  63. Kernel Trick If we choose Φ induced by Polynomial or Gaussian RBF kernel, then K i , j = y ( i ) y ( j ) k ( x ( i ) , x ) takes only O ( D ) time to evaluate, and f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i takes O ( ND ) time Independent with the augmented feature dimension α , β , and γ are new hyperparameters Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42

  64. Sparse Kernel Machines SVC is a kernel machine: f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i It is surprising that SVC works like K -NN in some sense Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 38 / 42

  65. Sparse Kernel Machines SVC is a kernel machine: f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i It is surprising that SVC works like K -NN in some sense However, SVC is a sparse kernel machine Only the slacks become the support vectors ( α i > 0 ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 38 / 42

  66. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  67. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  68. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  69. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b )  1 (usually strict) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  70. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i  0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b )  1 (usually strict) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  71. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i  0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i = 0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b )  1 (usually strict) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  72. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i  0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i = 0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b )  1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i = 0 Since β i = 0 , we have ξ i � 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  73. Remarks I Pros of SVC: Global optimality (convex problem) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42

Recommend


More recommend