Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Introduction to Data Mining, 2 nd Edition 02/17/2020 1 1 Support Vector Machines • Find a linear hyperplane (decision boundary) that will separate the data Introduction to Data Mining, 2 nd Edition 02/17/2020 2 2
Support Vector Machines • One Possible Solution Introduction to Data Mining, 2 nd Edition 02/17/2020 3 3 Support Vector Machines • Another possible solution Introduction to Data Mining, 2 nd Edition 02/17/2020 4 4
Support Vector Machines • Other possible solutions Introduction to Data Mining, 2 nd Edition 02/17/2020 5 5 Support Vector Machines • Which one is better? B1 or B2? • How do you define better? Introduction to Data Mining, 2 nd Edition 02/17/2020 6 6
Support Vector Machines • Find hyperplane maximizes the margin => B1 is better than B2 Introduction to Data Mining, 2 nd Edition 02/17/2020 7 7 Support Vector Machines w x b 0 w x b 1 w x b 1 1 if w x b 1 2 Margin f ( x ) || w || 1 if w x b 1 Introduction to Data Mining, 2 nd Edition 02/17/2020 8 8
Linear SVM • Linear model: 1 if w x b 1 f ( x ) 1 if w x b 1 • Learning the model is equivalent to determining the values of w and b – How to find from training data? w and b Introduction to Data Mining, 2 nd Edition 02/17/2020 9 9 Learning Linear SVM 2 • Objective is to maximize: Margin || w || 2 || w || – Which is equivalent to minimizing: L ( w ) 2 – Subject to the following constraints: 1 if w x b 1 i y i 1 if w x b 1 i or 𝑧 � � w • x � � 𝑐� � 1, 𝑗 � 1,2, . . . , 𝑂 This is a constrained optimization problem – Solve it using Lagrange multiplier method Introduction to Data Mining, 2 nd Edition 02/17/2020 10 10
Example of Linear SVM Support vectors x1 x2 y 0.3858 0.4687 1 65.5261 0.4871 0.611 -1 65.5261 0.9218 0.4103 -1 0 0.7382 0.8936 -1 0 0.1763 0.0579 1 0 0.4057 0.3529 1 0 0.9355 0.8132 -1 0 0.2146 0.0099 1 0 Introduction to Data Mining, 2 nd Edition 02/17/2020 11 11 Learning Linear SVM • Decision boundary depends only on support vectors – If you have data set with same support vectors, decision boundary will not change – How to classify using SVM once w and b are found? Given a test record, x i 1 if w x b 1 i f ( x ) i 1 if w x b 1 i Introduction to Data Mining, 2 nd Edition 02/17/2020 12 12
Support Vector Machines • What if the problem is not linearly separable? Introduction to Data Mining, 2 nd Edition 02/17/2020 13 13 Support Vector Machines • What if the problem is not linearly separable? – Introduce slack variables Need to minimize: 2 || w || N k L ( w ) C i 2 i 1 Subject to: 1 if w x b 1 - i i y i 1 if w x b 1 i i If k is 1 or 2, this leads to similar objective function as linear SVM but with different constraints (see textbook) Introduction to Data Mining, 2 nd Edition 02/17/2020 14 14
Support Vector Machines • Find the hyperplane that optimizes both factors Introduction to Data Mining, 2 nd Edition 02/17/2020 15 15 Nonlinear Support Vector Machines • What if decision boundary is not linear? Introduction to Data Mining, 2 nd Edition 02/17/2020 16 16
Nonlinear Support Vector Machines • Transform data into higher dimensional space Decision boundary: w ( x ) b 0 Introduction to Data Mining, 2 nd Edition 02/17/2020 17 17 Learning Nonlinear SVM • Optimization problem: • Which leads to the same set of equations (but involve (x) instead of x) Introduction to Data Mining, 2 nd Edition 02/17/2020 18 18
Learning NonLinear SVM • Issues: – What type of mapping function should be used? – How to do the computation in high dimensional space? Most computations involve dot product (x i ) (x j ) Curse of dimensionality? Introduction to Data Mining, 2 nd Edition 02/17/2020 19 19 Learning Nonlinear SVM • Kernel Trick: – (x i ) (x j ) = K(x i , x j ) – K(x i , x j ) is a kernel function (expressed in terms of the coordinates in the original space) Examples: Introduction to Data Mining, 2 nd Edition 02/17/2020 20 20
Example of Nonlinear SVM SVM with polynomial degree 2 kernel Introduction to Data Mining, 2 nd Edition 02/17/2020 21 21 Learning Nonlinear SVM • Advantages of using kernel: – Don’t have to know the mapping function – Computing dot product (x i ) (x j ) in the original space avoids curse of dimensionality • Not all functions can be kernels – Must make sure there is a corresponding in some high-dimensional space – Mercer’s theorem (see textbook) Introduction to Data Mining, 2 nd Edition 02/17/2020 22 22
Characteristics of SVM • The learning problem is formulated as a convex optimization problem – Efficient algorithms are available to find the global minima – Many of the other methods use greedy approaches and find locally optimal solutions – High computational complexity for building the model • Robust to noise • Overfitting is handled by maximizing the margin of the decision boundary, • SVM can handle irrelevant and redundant better than many other techniques • The user needs to provide the type of kernel function and cost function • Difficult to handle missing values • What about categorical variables? Introduction to Data Mining, 2 nd Edition 02/17/2020 23 23
Recommend
More recommend