Parzen Windows and Kernels Binary KNN classifier: ⇣ ∑ i : x ( i ) 2 KNN ( x ) y ( i ) ⌘ f ( x ) = sign The “radius” of voter boundary depends on the input x We can instead use the Parzen window with a fixed radius: ⇣ ∑ i y ( i ) 1 ( x ( i ) ; k x ( i ) � x k R ) ⌘ f ( x ) = sign Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42
Parzen Windows and Kernels Binary KNN classifier: ⇣ ∑ i : x ( i ) 2 KNN ( x ) y ( i ) ⌘ f ( x ) = sign The “radius” of voter boundary depends on the input x We can instead use the Parzen window with a fixed radius: ⇣ ∑ i y ( i ) 1 ( x ( i ) ; k x ( i ) � x k R ) ⌘ f ( x ) = sign Parzen windows also replace the hard boundary with a soft one: ⇣ ⌘ ∑ i y ( i ) k ( x ( i ) , x ) f ( x ) = sign k ( x ( i ) , x ) is a radial basis function (RBF) kernel whose value decreases along space radiating outward from x Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42
Parzen Windows and Kernels Binary KNN classifier: ⇣ ∑ i : x ( i ) 2 KNN ( x ) y ( i ) ⌘ f ( x ) = sign The “radius” of voter boundary depends on the input x We can instead use the Parzen window with a fixed radius: ⇣ ∑ i y ( i ) 1 ( x ( i ) ; k x ( i ) � x k R ) ⌘ f ( x ) = sign Parzen windows also replace the hard boundary with a soft one: ⇣ ⌘ ∑ i y ( i ) k ( x ( i ) , x ) f ( x ) = sign k ( x ( i ) , x ) is a radial basis function (RBF) kernel whose value decreases along space radiating outward from x Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42
Common RBF Kernels How to act like soft K -NN? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42
Common RBF Kernels How to act like soft K -NN? Gaussian RBF kernel: k ( x ( i ) , x ) = N ( x ( i ) � x ; 0 , σ 2 I ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42
Common RBF Kernels How to act like soft K -NN? Gaussian RBF kernel: k ( x ( i ) , x ) = N ( x ( i ) � x ; 0 , σ 2 I ) Or simply ⇣ � γ k x ( i ) � x k 2 ⌘ k ( x ( i ) , x ) = exp γ � 0 (or σ 2 ) is a hyperparameter controlling the smoothness of f Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42
Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 12 / 42
Locally Weighted Linear Regression In addition to the majority voting and average, we can define local models for lazy predictions Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 13 / 42
Locally Weighted Linear Regression In addition to the majority voting and average, we can define local models for lazy predictions E.g., in (eager) linear regression, we find w 2 R D + 1 that minimizes SSE: ( y ( i ) � w > x ( i ) ) 2 w ∑ argmin i Local model: to find w minimizing SSE local to the point x we want to predict : k ( x ( i ) , x )( y ( i ) � w > x ( i ) ) 2 w ∑ argmin i k ( · , · ) 2 R is an RBF kernel Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 13 / 42
Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 14 / 42
Kernel Machines Kernel machines : N c i k ( x ( i ) , x )+ c 0 ∑ f ( x ) = i = 1 For example: Parzen windows: c i = y ( i ) and c 0 = 0 Locally weighted linear regression: c i = ( y ( i ) � w > x ( i ) ) 2 and c 0 = 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 15 / 42
Kernel Machines Kernel machines : N c i k ( x ( i ) , x )+ c 0 ∑ f ( x ) = i = 1 For example: Parzen windows: c i = y ( i ) and c 0 = 0 Locally weighted linear regression: c i = ( y ( i ) � w > x ( i ) ) 2 and c 0 = 0 The variable c 2 R N can be learned in either an eager or lazy manner Pros: complex, but highly accurate if regularized well Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 15 / 42
Sparse Kernel Machines To make a prediction, we need to store all examples May be infeasible due to Large dataset ( N ) Time limit Space limit Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 16 / 42
Sparse Kernel Machines To make a prediction, we need to store all examples May be infeasible due to Large dataset ( N ) Time limit Space limit Can we make c sparse ? I.e., to make c i 6 = 0 for only a small fraction of examples called support vectors How? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 16 / 42
Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 17 / 42
Separating Hyperplane I Model: F = { f : f ( x ; w , b ) = w > x + b } A collection of hyperplanes Prediction: ˆ y = sign ( f ( x )) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42
Separating Hyperplane I Model: F = { f : f ( x ; w , b ) = w > x + b } A collection of hyperplanes Prediction: ˆ y = sign ( f ( x )) Training: to find w and b such that w > x ( i ) + b � 0 , if y ( i ) = 1 w > x ( i ) + b 0 , if y ( i ) = � 1 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42
Separating Hyperplane I Model: F = { f : f ( x ; w , b ) = w > x + b } A collection of hyperplanes Prediction: ˆ y = sign ( f ( x )) Training: to find w and b such that w > x ( i ) + b � 0 , if y ( i ) = 1 w > x ( i ) + b 0 , if y ( i ) = � 1 or simply y ( i ) ( w > x ( i ) + b ) � 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42
Separating Hyperplane II There are many feasible w ’s and b ’s when the classes are linearly separable Which hyperplane is the best? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 19 / 42
Support Vector Classification Support vector classifier (SVC) picks one with largest margin : y ( i ) ( w > x ( i ) + b ) � a for all i Margin: 2 a / k w k [Homework] Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 20 / 42
Support Vector Classification Support vector classifier (SVC) picks one with largest margin : y ( i ) ( w > x ( i ) + b ) � a for all i Margin: 2 a / k w k [Homework] With loss of generality, we let a = 1 and solve the problem: argmin w , b 1 2 k w k 2 sibject to y ( i ) ( w > x ( i ) + b ) � 1 , 8 i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 20 / 42
Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 21 / 42
Overlapping Classes In practice, classes may be overlapping Due to, e.g., noises or outliers Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 22 / 42
Overlapping Classes In practice, classes may be overlapping Due to, e.g., noises or outliers The problem argmin w , b 1 2 k w k 2 sibject to y ( i ) ( w > x ( i ) + b ) � 1 , 8 i has no solution in this case. How to fix this? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 22 / 42
Slacks SVC tolerates slacks that fall outside of the regions they ought to be Problem: 1 2 k w k 2 + C ∑ N i = 1 ξ i argmin w , b , ξ sibject to y ( i ) ( w > x ( i ) + b ) � 1 � ξ i and ξ i � 0 , 8 i Favors large margin but also fewer slacks Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 23 / 42
Hyperparameter C 2 k w k 2 + C ∑ N 1 i = 1 ξ i argmin w , b , ξ The hyperparameter C controls the tradeo ff between Maximizing margin Minimizing number of slacks Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 24 / 42
Hyperparameter C 2 k w k 2 + C ∑ N 1 i = 1 ξ i argmin w , b , ξ The hyperparameter C controls the tradeo ff between Maximizing margin Minimizing number of slacks Provides a geometric explanation to the weight decay Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 24 / 42
Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 25 / 42
Nonlinearly Separable Classes In practice, classes may be nonlinearly separable Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 26 / 42
Nonlinearly Separable Classes In practice, classes may be nonlinearly separable SVC (with slacks) gives “bad” hyperplanes due to underfitting How to make it nonlinear? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 26 / 42
Feature Augmentation Recall that in polynomial regression, we augment data features to make a linear regressor nonlinear Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 27 / 42
Feature Augmentation Recall that in polynomial regression, we augment data features to make a linear regressor nonlinear We can can define a function Φ ( · ) that maps each data point to a high dimensional space: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 27 / 42
Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 28 / 42
Time Complexity Nonlinear SVC: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i The higher augmented feature dimension, the more variables in w to solve Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 29 / 42
Time Complexity Nonlinear SVC: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i The higher augmented feature dimension, the more variables in w to solve Can we solve w in time complexity that is independent with the mapped dimension? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 29 / 42
Dual Problem Primal problem: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i Dual problem: argmax α , β min w , b , ξ L ( w , b , ξ , α , β ) subject to α � 0 , β � 0 where L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 30 / 42
Dual Problem Primal problem: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i Dual problem: argmax α , β min w , b , ξ L ( w , b , ξ , α , β ) subject to α � 0 , β � 0 where L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Primal problem is convex, so strong duality holds Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 30 / 42
Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42
Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: ∂ w = w � ∑ i α i y ( i ) Φ ( x ( i ) ) = 0 ) w = ∑ i α i y ( i ) Φ ( x ( i ) ) ∂ L Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42
Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: ∂ w = w � ∑ i α i y ( i ) Φ ( x ( i ) ) = 0 ) w = ∑ i α i y ( i ) Φ ( x ( i ) ) ∂ L ∂ b = ∑ i α i y ( i ) = 0 ∂ L Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42
Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: ∂ w = w � ∑ i α i y ( i ) Φ ( x ( i ) ) = 0 ) w = ∑ i α i y ( i ) Φ ( x ( i ) ) ∂ L ∂ b = ∑ i α i y ( i ) = 0 ∂ L ∂ L ∂ξ i = C � α i � β i = 0 ) β i = C � α i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42
Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42
Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i ∑ i α i � 1 2 ∑ i , j α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) , 8 > if ∑ i α i y ( i ) = 0 , > < w , b , ξ L ( w , b , ξ , α , β ) = min � ∞ , > > : otherwise Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42
Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i ∑ i α i � 1 2 ∑ i , j α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) , 8 > if ∑ i α i y ( i ) = 0 , > < w , b , ξ L ( w , b , ξ , α , β ) = min � ∞ , > > : otherwise Outer maximization problem: argmax α 1 > α � 1 2 α > K α subject to 0 α C 1 and y > α = 0 K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42
Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i ∑ i α i � 1 2 ∑ i , j α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) , 8 > if ∑ i α i y ( i ) = 0 , > < w , b , ξ L ( w , b , ξ , α , β ) = min � ∞ , > > : otherwise Outer maximization problem: argmax α 1 > α � 1 2 α > K α subject to 0 α C 1 and y > α = 0 K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) β i = C � α i � 0 implies α i C Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42
Solving Dual Problem II Dual minimization problem of SVC: argmin α 1 2 α > K α � 1 > α subject to 0 α C 1 and y > α = 0 Number of variables to solve? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42
Solving Dual Problem II Dual minimization problem of SVC: argmin α 1 2 α > K α � 1 > α subject to 0 α C 1 and y > α = 0 Number of variables to solve? N instead of augmented feature dimension Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42
Solving Dual Problem II Dual minimization problem of SVC: argmin α 1 2 α > K α � 1 > α subject to 0 α C 1 and y > α = 0 Number of variables to solve? N instead of augmented feature dimension In practice, this problem is solved by specialized solvers such as the sequential minimal optimization (SMO) [3] As K is usually ill-conditioned Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42
Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42
Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42
Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? By the complementary slackness of KKT conditions, we have: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42
Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? By the complementary slackness of KKT conditions, we have: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 For any x ( i ) having 0 < α i < C , we have β i = C � α i > 0 ) ξ i = 0 , ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 ) b = y ( i ) � w > Φ ( x ( i ) ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42
Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? By the complementary slackness of KKT conditions, we have: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 For any x ( i ) having 0 < α i < C , we have β i = C � α i > 0 ) ξ i = 0 , ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 ) b = y ( i ) � w > Φ ( x ( i ) ) In practice, we usually take the average over all x ( i ) ’s having 0 < α i < C to avoid numeric error Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42
Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 35 / 42
Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42
Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42
Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42
Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Polynomial kernel: k ( a , b ) = ( a > b / α + β ) γ E.g., let α = 1 , β = 1 , γ = 2 and a 2 R 2 , then p p p 2 a 1 a 2 ] > 2 R 6 2 a 2 , a 2 1 , a 2 Φ ( a ) = [ 1 , 2 a 1 , 2 , Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42
Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Polynomial kernel: k ( a , b ) = ( a > b / α + β ) γ E.g., let α = 1 , β = 1 , γ = 2 and a 2 R 2 , then p p p 2 a 1 a 2 ] > 2 R 6 2 a 2 , a 2 1 , a 2 Φ ( a ) = [ 1 , 2 a 1 , 2 , Gaussian RBF kernel: k ( a , b ) = exp ( � γ k a � b k 2 ) , γ � 0 k ( a , b ) = exp ( � γ k a k 2 + 2 γ a > b � γ k b k 2 )= exp ( � γ k a k 2 � γ k b k 2 )( 1 + 2 γ a > b + ( 2 γ a > b ) 2 + ··· ) 1! 2! Let a 2 R 2 , then Φ ( a ) = q q q q q 2! a 1 a 2 , ··· ] > 2 R ∞ 2 γ 2 γ 2 γ 2 γ γ exp ( � γ k a k 2 )[ 1 , 2! a 2 2! a 2 1! a 1 , 1! a 2 , 1 , 2 , 2 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42
Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Polynomial kernel: k ( a , b ) = ( a > b / α + β ) γ E.g., let α = 1 , β = 1 , γ = 2 and a 2 R 2 , then p p p 2 a 1 a 2 ] > 2 R 6 2 a 2 , a 2 1 , a 2 Φ ( a ) = [ 1 , 2 a 1 , 2 , Gaussian RBF kernel: k ( a , b ) = exp ( � γ k a � b k 2 ) , γ � 0 k ( a , b ) = exp ( � γ k a k 2 + 2 γ a > b � γ k b k 2 )= exp ( � γ k a k 2 � γ k b k 2 )( 1 + 2 γ a > b + ( 2 γ a > b ) 2 + ··· ) 1! 2! Let a 2 R 2 , then Φ ( a ) = q q q q q 2! a 1 a 2 , ··· ] > 2 R ∞ 2 γ 2 γ 2 γ 2 γ γ exp ( � γ k a k 2 )[ 1 , 2! a 2 2! a 2 1! a 1 , 1! a 2 , 1 , 2 , 2 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42
Kernel Trick If we choose Φ induced by Polynomial or Gaussian RBF kernel, then K i , j = y ( i ) y ( j ) k ( x ( i ) , x ) takes only O ( D ) time to evaluate, and f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i takes O ( ND ) time Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42
Kernel Trick If we choose Φ induced by Polynomial or Gaussian RBF kernel, then K i , j = y ( i ) y ( j ) k ( x ( i ) , x ) takes only O ( D ) time to evaluate, and f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i takes O ( ND ) time Independent with the augmented feature dimension Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42
Kernel Trick If we choose Φ induced by Polynomial or Gaussian RBF kernel, then K i , j = y ( i ) y ( j ) k ( x ( i ) , x ) takes only O ( D ) time to evaluate, and f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i takes O ( ND ) time Independent with the augmented feature dimension α , β , and γ are new hyperparameters Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42
Sparse Kernel Machines SVC is a kernel machine: f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i It is surprising that SVC works like K -NN in some sense Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 38 / 42
Sparse Kernel Machines SVC is a kernel machine: f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i It is surprising that SVC works like K -NN in some sense However, SVC is a sparse kernel machine Only the slacks become the support vectors ( α i > 0 ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 38 / 42
KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42
KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42
KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42
KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) 1 (usually strict) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42
KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i 0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) 1 (usually strict) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42
KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i 0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i = 0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) 1 (usually strict) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42
KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i 0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i = 0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) 1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i = 0 Since β i = 0 , we have ξ i � 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42
Remarks I Pros of SVC: Global optimality (convex problem) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42
Recommend
More recommend