e9 205 machine learning for signal procesing
play

E9 205 Machine Learning for Signal Procesing Support Vector Machines - PowerPoint PPT Presentation

E9 205 Machine Learning for Signal Procesing Support Vector Machines 9-10-2019 Linear Classifiers x f y est w x + b>0 0 = denotes +1 b + x denotes -1 w How would you classify this data? w x + b<0 SVM and applications,


  1. E9 205 Machine Learning for Signal Procesing Support Vector Machines 9-10-2019

  2. Linear Classifiers x f y est w x + b>0 0 = denotes +1 b + x denotes -1 w How would you classify this data? w x + b<0 “SVM and applications”, Mingyue Tan. Univ of British Columbia

  3. Linear Classifiers f x y est denotes +1 w x + b>0 denotes -1 w x + b=0 How would you classify this data? w x + b<0 “SVM and applications”, Mingyue Tan. Univ of British Columbia

  4. Linear Classifiers f x y est denotes +1 denotes -1 How would you classify this data? “SVM and applications”, Mingyue Tan. Univ of British Columbia

  5. Linear Classifiers f x y est denotes +1 denotes -1 How would you classify this data? “SVM and applications”, Mingyue Tan. Univ of British Columbia

  6. Linear Classifiers f x y est denotes +1 denotes -1 Any of these would be fine.. ..but which is best? “SVM and applications”, Mingyue Tan. Univ of British Columbia

  7. Linear Classifiers f x y est denotes +1 denotes -1 How would you classify this data? Misclassified to +1 class “SVM and applications”, Mingyue Tan. Univ of British Columbia

  8. Linear Classifiers f x y est denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

  9. Maximum Margin f x y est 1. Maximizing the margin is good f ( x , w ,b) = sign( w x + b) according to intuition denotes +1 2. Implies that only support vectors are The maximum denotes -1 important; other training examples are margin linear ignorable. classifier is the 3. Empirically it works very very well. linear classifier Support Vectors with the, um, are those data maximum margin. points that the margin pushes up This is the simplest against kind of SVM (Called an LSVM) Linear SVM

  10. Non-linear SVMs ■ Datasets that are linearly separable with some noise work out great: x 0 ■ But what are we going to do if the dataset is just too hard? x 0 ■ How about… mapping data to a higher-dimensional space: x 2 0 x

  11. Non-linear SVMs: Feature spaces ■ General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ : x → φ ( x )

  12. The “ Kernel Trick ” The linear classifier relies on dot product between vectors k (x i ,x j )= x i T x j ■ If every data point is mapped into high-dimensional space via some ■ transformation Φ : x → φ ( x ), the dot product becomes: k ( x i ,x j )= φ ( x i ) T φ ( x j ) A kernel function is some function that corresponds to an inner product in ■ some expanded feature space. Example: ■ 2-dimensional vectors x=[ x 1 x 2 ]; let k ( x i ,x j )=(1 + x iT x j ) 2, Need to show that K ( x i ,x j )= φ (x i ) T φ (x j ) : k ( x i ,x j )=(1 + x iT x j ) 2, = 1+ x i12 x j12 + 2 x i1 x j1 x i2 x j2 + x i22 x j22 + 2 x i1 x j1 + 2 x i2 x j2 = [1 x i12 √ 2 x i1 x i2 x i22 √ 2 x i1 √ 2 x i2 ] T [1 x j12 √ 2 x j1 x j2 x j22 √ 2 x j1 √ 2 x j2 ] = φ (x i ) T φ (x j ), where φ (x) = [1 x 12 √ 2 x 1 x 2 x 22 √ 2 x 1 √ 2 x 2 ]

  13. What Functions are Kernels? ■ For many functions k ( x i ,x j ) checking that k ( x i ,x j )= φ (x i ) T φ (x j ) can be cumbersome. ■ Mercer’s theorem: Every semi-positive definite symmetric function is a kernel ■ Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix: k ( x 1 , x 1 ) k ( x 1 , x 2 ) k ( x 1 , x 3 ) … k ( x 1 , x N ) k ( x 2 , x 1 ) k ( x 2 , x 2 ) k ( x 2 , x 3 ) k ( x 2 , x N ) K = … … … … … k ( x N , x 1 ) k ( x N , x 2 ) k ( x N , x 3 ) … k ( x N , x N )

  14. Examples of Kernel Functions ■ Linear: k ( x i , x j )= x i T x j ■ Polynomial of power p : k ( x i ,x j )= (1+ x i T x j ) p ■ Gaussian (radial-basis function network): ■ Sigmoid: k ( x i , x j )= tanh( β 0 x i T x j + β 1 )

  15. SVM Formulation ❖ Goal - 1) Correctly classify all training data 2) Define the Margin 3) Maximize the Margin ❖ Equivalently written as such that

  16. Solving the Optimization Problem Need to optimize a quadratic function subject to linear constraints. ■ Quadratic optimization problems are a well-known class of ■ mathematical programming problems, and many (rather intricate) algorithms exist for solving them. The solution involves constructing a dual problem where a Lagrange ■ multiplier is associated with every constraint in the primary problem: The dual problem in this case is maximized ■ Find such that maximized and ,

  17. Solving the Optimization Problem ■ The solution has the form: ■ Each non-zero a n indicates that corresponding x n is a support vector. Let S denote the set of support vectors. ■ And the classifying function will have the form:

  18. Solving the Optimization Problem

  19. Visualizing Gaussian Kernel SVM

  20. Overlapping class boundaries ■ The classes are not linearly separable - Introducing slack variables ■ Slack variables are non-negative ■ They are defined using ■ The upper bound on mis-classification ■ The cost function to be optimized in this case

  21. SVM Formulation - overlapping classes Formulation very similar to previous case except for ■ additional constraints ■ Solved using the dual formulation - sequential minimal optimization algorithm ■ Final classifier is based on the sign of

  22. Overlapping class boundaries

Recommend


More recommend