SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine - PowerPoint PPT Presentation

SVM Kernels COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning SVM Kernels 1 / 27

Outline 1 Linear Separability and Feature Augmentation 2 Sample Complexity 3 Computational Complexity 4 Kernels and Nonlinear SVMs 5 Mercer’s Conditions 6 Gaussian Kernels and Support Vectors COMPSCI 371D — Machine Learning SVM Kernels 2 / 27

Linear Separability and Feature Augmentation Data Representations • Linear separability is a property of the data in a given representation • A set that is not linearly separable. Boundary x 2 = x 2 1 COMPSCI 371D — Machine Learning SVM Kernels 3 / 27

Linear Separability and Feature Augmentation Feature Transformations • x = ( x 1 , x 2 ) → z = ( z 1 , z 2 ) = ( x 2 1 , x 2 ) • Now it is! Boundary z 2 = z 1 COMPSCI 371D — Machine Learning SVM Kernels 4 / 27

Linear Separability and Feature Augmentation Feature Augmentation • Feature transformation: x = ( x 1 , x 2 ) → z = ( z 1 , z 2 ) = ( x 2 1 , x 2 ) • Problem: We don’t know the boundary! • We cannot guess the correct transformation • Feature augmentation : x = ( x 1 , x 2 ) → z = ( z 1 , z 2 , z 3 ) = ( x 1 , x 2 , x 2 1 ) • Why is this better? • Add many features in the hope that some combination will help COMPSCI 371D — Machine Learning SVM Kernels 5 / 27

Linear Separability and Feature Augmentation Not Really Just a Hope! • Add all monomials of x 1 , x 2 up to some degree k • Example: k = 3 ⇒ d ′ = � d + k � 2 + 3 � � = = 10 monomials d 2 z = ( 1 , x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 , x 3 1 , x 2 1 x 2 , x 1 x 2 2 , x 3 2 ) • From Taylor’s theorem, we know that with k high enough we can approximate any hypersurface by a linear combination of the features in z • Issue 1: Sample complexity: More dimensions, more training data (remember the curse) • Issue 2: Computational complexity: More features, more work • With SVMs, we can address both issues COMPSCI 371D — Machine Learning SVM Kernels 6 / 27

Sample Complexity A Detour into Sample Complexity • The more training samples we have, the better we generalize • With a larger N , the set T represents the model p ( x , y ) better • How to formalize this notion? • Introduce a number ǫ that measures how far from optimal a classifier is • The smaller ǫ we want to be, the bigger N needs to be • Easier to think about: the bigger 1 /ǫ (“exactitude”), the bigger N • The rate of growth of N ( 1 /ǫ ) is the sample complexity , more or less • Removing “more or less” requires care COMPSCI 371D — Machine Learning SVM Kernels 7 / 27

Sample Complexity Various Risks Involved • We train a classifier on set T , by picking the best h ∈ H : ˆ h = ERM T ( H ) ∈ arg min h ∈H L T ( h ) • Empirical risk actually achieved by ˆ h : L T (ˆ h ) = L T ( H ) = min h ∈H L T ( h ) • When we deploy ˆ h we want its statistical risk to be small L p (ˆ h ) = E p [ ℓ ( y , ˆ h ( x ))] We can get some idea of L p (ˆ h ) by testing ˆ h • Typically, L p (ˆ h ) > L T (ˆ h ) • More importantly: How small can L p (ˆ h ) conceivably be? • L p (ˆ h ) is typically bigger than L p ( H ) = min h ∈H L p ( h ) COMPSCI 371D — Machine Learning SVM Kernels 8 / 27

Sample Complexity Risk Summary • Empirical training risk L T (ˆ h ) is just a means to an end • That’s what we minimize for training. Ignore that • Statistical risk achieved by ˆ h : L p (ˆ h ) • Smallest statistical risk over all h ∈ H : L p ( H ) = min h ∈H L p ( h ) • Obviously L p (ˆ h ) ≥ L p ( H ) (by definition of the latter) • Typically, L p (ˆ h ) > L p ( H ) . Why? • Because T is a poor proxy for p ( x , y ) • Also, often L p ( H ) > 0. Why? • Because H may not contain a perfect h • Example: Linear classifier for a non linearly-separable problem COMPSCI 371D — Machine Learning SVM Kernels 9 / 27

Sample Complexity Sample Complexity • Typically, L p (ˆ h ) > L p ( H ) ≥ 0 • Best we can do is L p (ˆ h ) = L p ( H ) + ǫ with small ǫ > 0 • High performance (large 1 /ǫ ) requires lots of data (large N ) • Sample complexity measures how fast N needs to grow as 1 /ǫ grows • It is the rate of growth of N ( 1 /ǫ ) • Problem: T is random, so even a huge N might give poor performance once in a while if we have bad luck (“statistical fluke”) • We cannot guarantee that a large N yields a small ǫ • We can guarantee that this happens with high probability COMPSCI 371D — Machine Learning SVM Kernels 10 / 27

Sample Complexity Sample Complexity, Cont’d • We can only give a probabilistic guarantee: • Given probability 0 < δ < 1 (think of this as “small”), we can guarantee that if N is large enough then the probability that L p (ˆ h ) ≥ L p ( H ) + ǫ is less than δ : P [ L p (ˆ h ) ≥ L p ( H ) + ǫ ] ≤ δ • The sample complexity for hypothesis space H is the function N H ( ǫ, δ ) that gives the smallest N for which this bound holds, regardless of model p ( x , y ) • Tall order: Typically, we can only give asymptotic bounds for N H ( ǫ, δ ) COMPSCI 371D — Machine Learning SVM Kernels 11 / 27

Sample Complexity Sample Complexity for Linear Classifiers and SVMs • For a binary linear classifier, the sample complexity is � d + log( 1 /δ ) � Ω ǫ • Grows linearly with d , the dimensionality of X , and 1 /ǫ • Not too bad, this is why linear classifiers are so successful • SVMs with bounded data space X do even better • “Bounded:” Contained in a hypersphere of finite radius • For SVMs with bounded X , the sample complexity is independent of d . No curse! • We can augment features to our heart’s content COMPSCI 371D — Machine Learning SVM Kernels 12 / 27

Computational Complexity What About Computational Complexity? • Remember our plan: Go from x = ( x 1 , x 2 ) to z = ( 1 , x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 , x 3 1 , x 2 1 x 2 , x 1 x 2 2 , x 3 2 ) in order to make the data separable • Can we do this without paying the computational cost? • Yes, with SVMs COMPSCI 371D — Machine Learning SVM Kernels 13 / 27

Computational Complexity SVMs and the Representer Theorem • Recall the formulation of SVM training: Minimize N f ( w , ξ ) = 1 2 � w � 2 + γ � ξ n . n = 1 with constraints y n ( w T x n + b ) − 1 + ξ n ≥ 0 ξ n ≥ 0 . • Representer theorem: w = � n ∈A ( w , b ) α n y n x n � w � 2 = w T w = � � α m α n y m y n x T m x n m ∈A ( w , b ) n ∈A ( w , b ) COMPSCI 371D — Machine Learning SVM Kernels 14 / 27

Kernels and Nonlinear SVMs Using the Representer Theorem w = � • Representer theorem: n ∈A ( w , b ) α n y n x n • In the constraint y n ( w T x n + b ) − 1 + ξ n ≥ 0 we have � w T x n = α m y m x T m x n m ∈A ( w , b ) • Summary: x appears in an inner product, never alone : N 1 � � � α m α n y m y n x T min m x n + C ξ n 2 w , b , ξ m ∈A ( u ) n ∈A ( u ) n = 1 subject to the constraints    � α m y m x T  − 1 + ξ n y n m x n + b ≥ 0 m ∈A ( u ) ≥ 0 ξ n COMPSCI 371D — Machine Learning SVM Kernels 15 / 27

Kernels and Nonlinear SVMs The Kernel • Augment x ∈ R d to ϕ ( x ) ∈ R d ′ , with d ′ ≫ d (typically) N 1 � � � α m α n y m y n ϕ ( x m ) T ϕ ( x n ) + C min ξ n 2 w , b , ξ m ∈A ( u ) n ∈A ( u ) n = 1 subject to the constraints    � α m y m ϕ ( x m ) T ϕ ( x n ) + b  − 1 + ξ n y n ≥ 0 m ∈A ( u ) ξ n ≥ 0 . def = ϕ ( x m ) T ϕ ( x n ) is a number • The value K ( x m , x n ) • The optimization algorithm needs to know only K ( x m , x n ) , not ϕ ( x n ) . K is called a kernel COMPSCI 371D — Machine Learning SVM Kernels 16 / 27

Kernels and Nonlinear SVMs Decision Rule • Same holds for the decision rule: y = h ( x ) = sign ( w T x + b ) ˆ becomes   � α m y m x T ˆ y = h ( x ) = sign m x + b   m ∈A ( w , b ) because of the representer theorem w = � n ∈A ( w , b ) α n y n x n and therefore, after feature augmentation,   � α m y m ϕ ( x m ) T ϕ ( x ) + b ˆ y = h ( x ) = sign   m ∈A ( w , b ) COMPSCI 371D — Machine Learning SVM Kernels 17 / 27

Kernels and Nonlinear SVMs Kernel Idea 1 • Start with some ϕ ( x ) and use the kernel to save computation • Example: ϕ ( x ) = ( 1 , x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 , x 3 1 , x 2 1 x 2 , x 1 x 2 2 , x 3 2 ) • Don’t know how to simplify. Try this: ϕ ( x ) = √ √ √ √ √ √ √ 3 x 2 3 x 2 2 , x 3 3 x 2 3 x 1 x 2 2 , x 3 ( 1 , 3 x 1 , 3 x 2 , 6 x 1 x 2 , 1 x 2 , 2 ) 1 , 1 , • Can show (see notes) that K ( x , z ) = ϕ ( x ) T ϕ ( z ) = ( x T z + 1 ) 3 • Something similar works for any d and k • 4 products and 2 sums instead of 10 products and 9 sums • Meager savings, but grows exponentially with d and k , as we know COMPSCI 371D — Machine Learning SVM Kernels 18 / 27

Kernels and Nonlinear SVMs Much Better Kernel Idea 2 • Just come up with K ( x , z ) without knowing the corresponding ϕ ( x ) • Not just any K . Must behave like an inner product • For instance, x T z = z T x and ( x T z ) 2 ≤ � x � 2 � z � 2 (symmetry and Cauchy-Schwartz), so we need at least K 2 ( x , z ) ≤ K ( x , x ) K ( z , z ) K ( x , z ) = K ( z , x ) and • These conditions are necessary, but they are not sufficient • Fortunately, there is a theory for this COMPSCI 371D — Machine Learning SVM Kernels 19 / 27

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine - PowerPoint PPT Presentation

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 / 27 Outline 1 Linear Separability and Feature Augmentation 2 Sample Complexity 3 Computational Complexity 4 Kernels and Nonlinear SVMs 5 Mercers

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the empirical and true/population

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

Learning with kernels and SVM malova chata, 23. kv etna, 2006 Petra Kudov malka,

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Fitting SVM models in Matlab mdl = fitcsvm(X,y) fit a classifier using SVM X is a

An SVM- -based Masquerade Detection based Masquerade Detection An SVM Method with Online Update

About me A data engineering challenge

Solution approaches towards verifjed -Kernel Danny Ziesche August 25, 2017 RheinMain

The Greatest Challenge Joachim Parrow Bertinoro 2014 The slides for this talk is a subset of the

Finding Your Bot-Mate: Criteria for evaluating robot kits for use in undergraduate computer

Excursion 3 Tour III Capability and Severity: Deeper Concepts Frequentist Family Feud A

Multicore Processing Element for SIMD Computing Da-Qi Ren and Reiji Suda Department of Computer

ECE 2162 Branch Prediction Control Dependencies Branches are very frequent Approx. 20%

Rigorous Evaluation Analysis and Reporting Structure is from A Practical Guide to Usability