SVM Kernels COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning SVM Kernels 1 / 27
Outline 1 Linear Separability and Feature Augmentation 2 Sample Complexity 3 Computational Complexity 4 Kernels and Nonlinear SVMs 5 Mercer’s Conditions 6 Gaussian Kernels and Support Vectors COMPSCI 371D — Machine Learning SVM Kernels 2 / 27
Linear Separability and Feature Augmentation Data Representations • Linear separability is a property of the data in a given representation • A set that is not linearly separable. Boundary x 2 = x 2 1 COMPSCI 371D — Machine Learning SVM Kernels 3 / 27
Linear Separability and Feature Augmentation Feature Transformations • x = ( x 1 , x 2 ) → z = ( z 1 , z 2 ) = ( x 2 1 , x 2 ) • Now it is! Boundary z 2 = z 1 COMPSCI 371D — Machine Learning SVM Kernels 4 / 27
Linear Separability and Feature Augmentation Feature Augmentation • Feature transformation: x = ( x 1 , x 2 ) → z = ( z 1 , z 2 ) = ( x 2 1 , x 2 ) • Problem: We don’t know the boundary! • We cannot guess the correct transformation • Feature augmentation : x = ( x 1 , x 2 ) → z = ( z 1 , z 2 , z 3 ) = ( x 1 , x 2 , x 2 1 ) • Why is this better? • Add many features in the hope that some combination will help COMPSCI 371D — Machine Learning SVM Kernels 5 / 27
Linear Separability and Feature Augmentation Not Really Just a Hope! • Add all monomials of x 1 , x 2 up to some degree k • Example: k = 3 ⇒ d ′ = � d + k � 2 + 3 � � = = 10 monomials d 2 z = ( 1 , x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 , x 3 1 , x 2 1 x 2 , x 1 x 2 2 , x 3 2 ) • From Taylor’s theorem, we know that with k high enough we can approximate any hypersurface by a linear combination of the features in z • Issue 1: Sample complexity: More dimensions, more training data (remember the curse) • Issue 2: Computational complexity: More features, more work • With SVMs, we can address both issues COMPSCI 371D — Machine Learning SVM Kernels 6 / 27
Sample Complexity A Detour into Sample Complexity • The more training samples we have, the better we generalize • With a larger N , the set T represents the model p ( x , y ) better • How to formalize this notion? • Introduce a number ǫ that measures how far from optimal a classifier is • The smaller ǫ we want to be, the bigger N needs to be • Easier to think about: the bigger 1 /ǫ (“exactitude”), the bigger N • The rate of growth of N ( 1 /ǫ ) is the sample complexity , more or less • Removing “more or less” requires care COMPSCI 371D — Machine Learning SVM Kernels 7 / 27
Sample Complexity Various Risks Involved • We train a classifier on set T , by picking the best h ∈ H : ˆ h = ERM T ( H ) ∈ arg min h ∈H L T ( h ) • Empirical risk actually achieved by ˆ h : L T (ˆ h ) = L T ( H ) = min h ∈H L T ( h ) • When we deploy ˆ h we want its statistical risk to be small L p (ˆ h ) = E p [ ℓ ( y , ˆ h ( x ))] We can get some idea of L p (ˆ h ) by testing ˆ h • Typically, L p (ˆ h ) > L T (ˆ h ) • More importantly: How small can L p (ˆ h ) conceivably be? • L p (ˆ h ) is typically bigger than L p ( H ) = min h ∈H L p ( h ) COMPSCI 371D — Machine Learning SVM Kernels 8 / 27
Sample Complexity Risk Summary • Empirical training risk L T (ˆ h ) is just a means to an end • That’s what we minimize for training. Ignore that • Statistical risk achieved by ˆ h : L p (ˆ h ) • Smallest statistical risk over all h ∈ H : L p ( H ) = min h ∈H L p ( h ) • Obviously L p (ˆ h ) ≥ L p ( H ) (by definition of the latter) • Typically, L p (ˆ h ) > L p ( H ) . Why? • Because T is a poor proxy for p ( x , y ) • Also, often L p ( H ) > 0. Why? • Because H may not contain a perfect h • Example: Linear classifier for a non linearly-separable problem COMPSCI 371D — Machine Learning SVM Kernels 9 / 27
Sample Complexity Sample Complexity • Typically, L p (ˆ h ) > L p ( H ) ≥ 0 • Best we can do is L p (ˆ h ) = L p ( H ) + ǫ with small ǫ > 0 • High performance (large 1 /ǫ ) requires lots of data (large N ) • Sample complexity measures how fast N needs to grow as 1 /ǫ grows • It is the rate of growth of N ( 1 /ǫ ) • Problem: T is random, so even a huge N might give poor performance once in a while if we have bad luck (“statistical fluke”) • We cannot guarantee that a large N yields a small ǫ • We can guarantee that this happens with high probability COMPSCI 371D — Machine Learning SVM Kernels 10 / 27
Sample Complexity Sample Complexity, Cont’d • We can only give a probabilistic guarantee: • Given probability 0 < δ < 1 (think of this as “small”), we can guarantee that if N is large enough then the probability that L p (ˆ h ) ≥ L p ( H ) + ǫ is less than δ : P [ L p (ˆ h ) ≥ L p ( H ) + ǫ ] ≤ δ • The sample complexity for hypothesis space H is the function N H ( ǫ, δ ) that gives the smallest N for which this bound holds, regardless of model p ( x , y ) • Tall order: Typically, we can only give asymptotic bounds for N H ( ǫ, δ ) COMPSCI 371D — Machine Learning SVM Kernels 11 / 27
Sample Complexity Sample Complexity for Linear Classifiers and SVMs • For a binary linear classifier, the sample complexity is � d + log( 1 /δ ) � Ω ǫ • Grows linearly with d , the dimensionality of X , and 1 /ǫ • Not too bad, this is why linear classifiers are so successful • SVMs with bounded data space X do even better • “Bounded:” Contained in a hypersphere of finite radius • For SVMs with bounded X , the sample complexity is independent of d . No curse! • We can augment features to our heart’s content COMPSCI 371D — Machine Learning SVM Kernels 12 / 27
Computational Complexity What About Computational Complexity? • Remember our plan: Go from x = ( x 1 , x 2 ) to z = ( 1 , x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 , x 3 1 , x 2 1 x 2 , x 1 x 2 2 , x 3 2 ) in order to make the data separable • Can we do this without paying the computational cost? • Yes, with SVMs COMPSCI 371D — Machine Learning SVM Kernels 13 / 27
Computational Complexity SVMs and the Representer Theorem • Recall the formulation of SVM training: Minimize N f ( w , ξ ) = 1 2 � w � 2 + γ � ξ n . n = 1 with constraints y n ( w T x n + b ) − 1 + ξ n ≥ 0 ξ n ≥ 0 . • Representer theorem: w = � n ∈A ( w , b ) α n y n x n � w � 2 = w T w = � � α m α n y m y n x T m x n m ∈A ( w , b ) n ∈A ( w , b ) COMPSCI 371D — Machine Learning SVM Kernels 14 / 27
Kernels and Nonlinear SVMs Using the Representer Theorem w = � • Representer theorem: n ∈A ( w , b ) α n y n x n • In the constraint y n ( w T x n + b ) − 1 + ξ n ≥ 0 we have � w T x n = α m y m x T m x n m ∈A ( w , b ) • Summary: x appears in an inner product, never alone : N 1 � � � α m α n y m y n x T min m x n + C ξ n 2 w , b , ξ m ∈A ( u ) n ∈A ( u ) n = 1 subject to the constraints � α m y m x T − 1 + ξ n y n m x n + b ≥ 0 m ∈A ( u ) ≥ 0 ξ n COMPSCI 371D — Machine Learning SVM Kernels 15 / 27
Kernels and Nonlinear SVMs The Kernel • Augment x ∈ R d to ϕ ( x ) ∈ R d ′ , with d ′ ≫ d (typically) N 1 � � � α m α n y m y n ϕ ( x m ) T ϕ ( x n ) + C min ξ n 2 w , b , ξ m ∈A ( u ) n ∈A ( u ) n = 1 subject to the constraints � α m y m ϕ ( x m ) T ϕ ( x n ) + b − 1 + ξ n y n ≥ 0 m ∈A ( u ) ξ n ≥ 0 . def = ϕ ( x m ) T ϕ ( x n ) is a number • The value K ( x m , x n ) • The optimization algorithm needs to know only K ( x m , x n ) , not ϕ ( x n ) . K is called a kernel COMPSCI 371D — Machine Learning SVM Kernels 16 / 27
Kernels and Nonlinear SVMs Decision Rule • Same holds for the decision rule: y = h ( x ) = sign ( w T x + b ) ˆ becomes � α m y m x T ˆ y = h ( x ) = sign m x + b m ∈A ( w , b ) because of the representer theorem w = � n ∈A ( w , b ) α n y n x n and therefore, after feature augmentation, � α m y m ϕ ( x m ) T ϕ ( x ) + b ˆ y = h ( x ) = sign m ∈A ( w , b ) COMPSCI 371D — Machine Learning SVM Kernels 17 / 27
Kernels and Nonlinear SVMs Kernel Idea 1 • Start with some ϕ ( x ) and use the kernel to save computation • Example: ϕ ( x ) = ( 1 , x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 , x 3 1 , x 2 1 x 2 , x 1 x 2 2 , x 3 2 ) • Don’t know how to simplify. Try this: ϕ ( x ) = √ √ √ √ √ √ √ 3 x 2 3 x 2 2 , x 3 3 x 2 3 x 1 x 2 2 , x 3 ( 1 , 3 x 1 , 3 x 2 , 6 x 1 x 2 , 1 x 2 , 2 ) 1 , 1 , • Can show (see notes) that K ( x , z ) = ϕ ( x ) T ϕ ( z ) = ( x T z + 1 ) 3 • Something similar works for any d and k • 4 products and 2 sums instead of 10 products and 9 sums • Meager savings, but grows exponentially with d and k , as we know COMPSCI 371D — Machine Learning SVM Kernels 18 / 27
Kernels and Nonlinear SVMs Much Better Kernel Idea 2 • Just come up with K ( x , z ) without knowing the corresponding ϕ ( x ) • Not just any K . Must behave like an inner product • For instance, x T z = z T x and ( x T z ) 2 ≤ � x � 2 � z � 2 (symmetry and Cauchy-Schwartz), so we need at least K 2 ( x , z ) ≤ K ( x , x ) K ( z , z ) K ( x , z ) = K ( z , x ) and • These conditions are necessary, but they are not sufficient • Fortunately, there is a theory for this COMPSCI 371D — Machine Learning SVM Kernels 19 / 27
Recommend
More recommend