CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017
Announcements • Homework 1 • Due end of the day of this Thursday (11:59pm) • Reminder of late submission policy • original score * • E.g., if you are t = 12 hours late, maximum of half score will be obtained; if you are 24 hours late, 0 score will be given. 2
Methods to Learn: Last Lecture Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN SVM; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 3
Methods to Learn Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 4
Support Vector Machine • Introduction • Linear SVM • Non-linear SVM • Scalability Issues* • Summary 5
Math Review • Vector • 𝒚 = x 1 , x 2 , … , 𝑦 𝑜 • Su rs: 𝒚 = 𝒄 − 𝒃 Subt btrac racti ting ng tw two v o vec ecto tors: • Dot product • 𝒃 ⋅ 𝒄 = ∑𝑏 𝑗 𝑐 𝑗 • Geometric interpretation: projection • If 𝒃 𝑏𝑜𝑒 𝒄 are orthogonal, 𝒃 ⋅ 𝒄 = 0 6
Math Review (Cont.) • Plane/Hyperplane • 𝑏 1 𝑦 1 + 𝑏 2 𝑦 2 + ⋯ + 𝑏 𝑜 𝑦 𝑜 = 𝑑 • Line (n=2), plane (n=3), hyperplane (higher dimensions) • Normal of a plane • 𝒐 = 𝑏 1 , 𝑏 2 , … , 𝑏 𝑜 • a vector which is perpendicular to the surface 7
Math Review (Cont.) • Define a plane using normal 𝒐 = 𝑏, 𝑐, 𝑑 and a point (𝑦 0 , 𝑧 0 , 𝑨 0 ) in the plane: • 𝑏, 𝑐, 𝑑 ⋅ 𝑦 0 − 𝑦, 𝑧 0 − 𝑧, 𝑨 0 − 𝑨 = 0 ⇒ 𝑏𝑦 + 𝑐𝑧 + 𝑑𝑨 = 𝑏𝑦 0 + 𝑐𝑧 0 + 𝑑𝑨 0 (= 𝑒) • Distance from a point (𝑦 0 , 𝑧 0 , 𝑨 0 ) to a plane 𝑏𝑦 + 𝑐𝑧 + 𝑑𝑨 = d 𝑏,𝑐,𝑑 • 𝑦 0 − 𝑦, 𝑧 0 − 𝑧, 𝑨 0 − 𝑨 ⋅ = 𝑏,𝑐,𝑑 𝑏𝑦 0 +𝑐𝑧 0 +𝑑𝑨 0 −𝑒 𝑏 2 +𝑐 2 +𝑑 2 8
Linear Classifier 𝑂 • Given a training dataset 𝒚 𝑗 , 𝑧 𝑗 𝑗=1 A separating hyperplane can be written as a linear combination of attributes W ● X + b = 0 where W={w 1 , w 2 , …, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as w 0 + w 1 x 1 + w 2 x 2 = 0 Classification: w 0 + w 1 x 1 + w 2 x 2 > 0 => y i = +1 w 0 + w 1 x 1 + w 2 x 2 ≤ 0 => y i = – 1 9
Recall • Is the decision boundary for logistic regression linear? • Is the decision boundary for decision tree linear? 10
Simple Linear Classifier: Perceptron Loss function: max{0, −𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 } 11
More on Sign Function • 12
Example 13
Support Vector Machine • Introduction • Linear SVM • Non-linear SVM • Scalability Issues* • Summary 14
Can we do better? • Which hyperplane to choose? 15
SVM — Margins and Support Vectors Small Margin Large Margin Support Vectors 16
SVM — When Data Is Linearly Separable m Let data D be ( X 1 , y 1 ), …, ( X |D| , y |D| ), where X i is the set of training tuples associated with the class labels y i There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin , i.e., maximum marginal hyperplane (MMH) 17
SVM — Linearly Separable A separating hyperplane can be written as W ● X + b = 0 The hyperplane defining the sides of the margin, e.g.,: H 1 : w 0 + w 1 x 1 + w 2 x 2 ≥ 1 for y i = +1, and H 2 : w 0 + w 1 x 1 + w 2 x 2 ≤ – 1 for y i = – 1 Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints Quadratic Programming (QP) Lagrangian multipliers 18
Maximum Margin Calculation • w : decision hyperplane normal vector • x i : data point i • y i : class of data point i (+1 or -1) w T x a + b = 1 ρ 2 w T x b + b = -1 𝑛𝑏𝑠𝑗𝑜: 𝜍 = ||𝒙|| Hint: what is the distance between 𝑦 𝑏 and w T x + b = -1 w T x + b = 0 19
SVM as a Quadratic Programming • QP 2 Objective: Find w and b such that 𝜍 = ||𝒙|| is maximized; Constraints: For all { ( x i , y i )} w T x i + b ≥ 1 if y i =1; w T x i + b ≤ - 1 if y i = -1 • A better form Objective: Find w and b such that Φ ( w ) =½ w T w is minimized; Constraints: for all { ( x i , y i )} : y i ( w T x i + b ) ≥ 1 20
Solve QP • This is now optimizing a quadratic function subject to linear constraints • Quadratic optimization problems are a well- known class of mathematical programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs) • The solution involves constructing a dual problem where a Lagrange multiplier α i is associated with every constraint in the primary problem: 21
Lagrange Formulation 22
Primal Form and Dual Form Objective: Find w and b such that Φ ( w ) =½ w T w is minimized; Primal Constraints: for all { ( x i , y i )} : y i ( w T x i + b ) ≥ 1 Equivalent under some conditions: KKT conditions Objective: Find α 1 …α n such that T x j is maximized and Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i Dual Constraints (1) Σ α i y i = 0 (2) α i ≥ 0 for all α i • More derivations: http://cs229.stanford.edu/notes/cs229-notes3.pdf 23
The Optimization Problem Solution • The solution has the form: w = Σ α i y i x i b = y k - w T x k for any x k such that α k 0 • Each non-zero α i indicates that corresponding x i is a support vector. • Then the classifying function will have the form: f ( x ) = Σ α i y i x i T x + b • Notice that it relies on an inner product between the test point x and the support vectors x i • We will return to this later. • Also keep in mind that solving the optimization problem involved T x j between all pairs of training computing the inner products x i points. 24
Sec. 15.2.1 Soft Margin Classification • If the training data is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples. • Allow some errors • Let some points be ξ i moved to where they ξ j belong, at a cost • Still, try to minimize training set errors, and to place hyperplane “ far ” from each class (large margin) 25
Sec. 15.2.1 Soft Margin Classification Mathematically • The old formulation: Find w and b such that Φ ( w ) =½ w T w is minimized and for all { ( x i , y i )} y i ( w T x i + b) ≥ 1 • The new formulation incorporating slack variables: Find w and b such that Φ ( w ) =½ w T w + C Σ ξ i is minimized and for all { ( x i , y i )} y i ( w T x i + b ) ≥ 1- ξ i and ξ i ≥ 0 for all i • Parameter C can be viewed as a way to control overfitting • A regularization term (L1 regularization) 26
Sec. 15.2.1 Soft Margin Classification – Solution • The dual problem for soft margin classification: Find α 1 …α N such that Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i • Neither slack variables ξ i nor their Lagrange multipliers appear in the dual problem! • Again, x i with non-zero α i will be support vectors . • If 0< α i <C, ξ i =0 • If α i =C, ξ i >0 • Solution to the problem is: w is not needed explicitly w = Σ α i y i x i for classification! b = y k - w T x k for any x k such that 0< α k <C f ( x ) = Σ α i y i x i T x + b 27
Sec. 15.1 Classification with SVMs • Given a new point x , we can score its projection onto the hyperplane normal: • I.e., compute score: w T x + b = Σ α i y i x i T x x + + b • Decide class based on whether < or > 0 • Can set confidence threshold t . Score > t : yes Score < - t : no 1 Else: don ’ t know -10 28
Sec. 15.2.1 Linear SVMs: Summary • The classifier is a separating hyperplane. • The most “ important ” training points are the support vectors; they define the hyperplane. • Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i . • Both in the dual formulation of the problem and in the solution, training points appear only inside inner products: f ( x ) = Σ α i y i x i Find α 1 …α N such that T x + b Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i 29
Support Vector Machine • Introduction • Linear SVM • Non-linear SVM • Scalability Issues* • Summary 30
Sec. 15.2.3 Non-linear SVMs • Datasets that are linearly separable (with some noise) work out great: x 0 • But what are we going to do if the dataset is just too hard? x 0 • How about … mapping data to a higher -dimensional space: x 2 x 0 31
Recommend
More recommend