Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 21: Support Vector Machines Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 1 /

Hyperplanes Let D = { ( x i , y i ) } n i = 1 be a classification dataset, with n points in a d -dimensional space. We assume that there are only two class labels, that is, y i ∈ { + 1 , − 1 } , denoting the positive and negative classes. A hyperplane in d dimensions is given as the set of all points x ∈ R d that satisfy the equation h ( x ) = 0, where h ( x ) is the hyperplane function : h ( x ) = w T x + b = w 1 x 1 + w 2 x 2 + ··· + w d x d + b Here, w is a d dimensional weight vector and b is a scalar, called the bias . For points that lie on the hyperplane, we have h ( x ) = w T x + b = 0 The weight vector w specifies the direction that is orthogonal or normal to the hyperplane, which fixes the orientation of the hyperplane, whereas the bias b fixes the offset of the hyperplane in the d -dimensional space, i.e., where the hyperplane intersects each of the axes: x i = − b w i x i = − b or w i Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 2 /

Separating Hyperplane A hyperplane splits the d -dimensional data space into two half-spaces . A dataset is said to be linearly separable if each half-space has points only from a single class. If the input dataset is linearly separable, then we can find a separating hyperplane h ( x ) = 0, such that for all points labeled y i = − 1, we have h ( x i ) < 0, and for all points labeled y i = + 1, we have h ( x i ) > 0. The hyperplane function h ( x ) thus serves as a linear classifier or a linear discriminant, which predicts the class y for any given point x , according to the decision rule: � + 1 if h ( x ) > 0 y = − 1 if h ( x ) < 0 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 3 /

Geometry of a Hyperplane: Distance Consider a point x ∈ R d that does not lie on the hyperplane. Let x p be the orthogonal projection of x on the hyperplane, and let r = x − x p . Then we can write x as x = x p + r = x p + r w � w � where r is the directed distance of the point x from x p . To obtain an expression for r , consider the value h ( x ) , we have: � � � � x p + r w x p + r w = w T h ( x ) = h + b = r � w � � w � � w � The directed distance r of point x to the hyperplane is thus: r = h ( x ) � w � To obtain distance, which must be non-negative, we multiply r by the class label y i of the point x i because when h ( x i ) < 0, the class is − 1, and when h ( x i ) > 0 the class is + 1: δ i = y i h ( x i ) � w � Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 4 /

ut bc bc b ut ut ut ut ut bc bc bc bc bc bc bc Geometry of a Hyperplane in 2D h ( x ) < 0 h ( x ) = 0 h ( x ) > 0 5 w � w � b x 4 w � w � r = r 3 x p 2 b 1 � w � 0 0 1 2 3 4 5 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 5 /

Margin and Support Vectors The distance of a point x from the hyperplane h ( x ) = 0 is thus given as δ = y r = y h ( x ) � w � The margin is the minimum distance of a point from the separating hyperplane: � y i ( w T x i + b ) � δ ∗ = min � w � x i All the points (or vectors) that achieve the minimum distance are called support vectors for the hyperplane. They satisfy the condition: δ ∗ = y ∗ ( w T x ∗ + b ) � w � where y ∗ is the class label for x ∗ . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 6 /

Canonical Hyperplane Multiplying the hyperplane equation on both sides by some scalar s yields an equivalent hyperplane: s h ( x ) = s w T x + s b = ( s w ) T x + ( sb ) = 0 To obtain the unique or canonical hyperplane, we choose the scalar 1 s = y ∗ ( w T x ∗ + b ) so that the absolute distance of a support vector from the hyperplane is 1, i.e., the margin is δ ∗ = y ∗ ( w T x ∗ + b ) 1 = � w � � w � For the canonical hyperplane, for each support vector x ∗ i (with label y ∗ i ), we have y ∗ i h ( x ∗ i ) = 1, and for any point that is not a support vector we have y i h ( x i ) > 1. Over all points, we have y i ( w T x i + b ) ≥ 1 , for all points x i ∈ D Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 7 /

bC bC ut ut ut ut bc bc bc bc bc uT uT bC Separating Hyperplane: Margin and Support Vectors Shaded points are support vectors Canonical hyperplane: h ( x ) = 5 / 6 x + 2 / 6 y − 20 / 6 = 0 . 334 x + 0 . 833 y − 3 . 332 h ( x ) = 5 0 4 1 3 � w � 1 � w � 2 1 0 1 2 3 4 5 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 8 /

SVM: Linear and Separable Case Assume that the points are linearly separable, that is, there exists a separating hyperplane that perfectly classifies each point. The goal of SVMs is to choose the canonical hyperplane, h ∗ , that yields the maximum margin among all possible separating hyperplanes � 1 � h ∗ = argmax � w � w , b We can obtain an equivalent minimization formulation: � � w � 2 � Objective Function: min 2 w , b Linear Constraints: y i ( w T x i + b ) ≥ 1 , ∀ x i ∈ D Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 9 /

SVM: Linear and Separable Case We turn the constrained SVM optimization into an unconstrained one by introducing a Lagrange multiplier α i for each constraint. The new objective function, called the Lagrangian , then becomes n min L = 1 2 � w � 2 − � � � y i ( w T x i + b ) − 1 α i i = 1 L should be minimized with respect to w and b , and it should be maximized with respect to α i . Taking the derivative of L with respect to w and b , and setting those to zero, we obtain n n ∂ � � ∂ w L = w − α i y i x i = 0 or w = α i y i x i i = 1 i = 1 n ∂ � ∂ b L = α i y i = 0 i = 1 We can see that w can be expressed as a linear combination of the data points x i , with the signed Lagrange multipliers, α i y i , serving as the coefficients. Further, the sum of the signed Lagrange multipliers, α i y i , must be zero. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 10

SVM: Linear and Separable Case n n � � Incorporating w = α i y i x i and α i y i = 0 into the Lagrangian we obtain the i = 1 i = 1 new dual Lagrangian objective function, which is specified purely in terms of the Lagrange multipliers: n n n α i − 1 � � � α i α j y i y j x T Objective Function: max L dual = i x j 2 α i = 1 i = 1 j = 1 n � Linear Constraints: α i ≥ 0 , ∀ i ∈ D , and α i y i = 0 i = 1 where α = ( α 1 ,α 2 ,...,α n ) T is the vector comprising the Lagrange multipliers. L dual is a convex quadratic programming problem (note the α i α j terms), which admits a unique optimal solution. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 11

SVM: Linear and Separable Case Once we have obtained the α i values for i = 1 ,..., n , we can solve for the weight vector w and the bias b . Each of the Lagrange multipliers α i satisfies the KKT conditions at the optimal solution: y i ( w T x i + b ) − 1 � � α i = 0 which gives rise to two cases: α i = 0, or (1) y i ( w T x i + b ) − 1 = 0, which implies y i ( w T x i + b ) = 1 (2) This is a very important result because if α i > 0, then y i ( w T x i + b ) = 1, and thus the point x i must be a support vector. On the other hand, if y i ( w T x i + b ) > 1, then α i = 0, that is, if a point is not a support vector, then α i = 0. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 12

Linear and Separable Case: Weight Vector and Bias Once we know α i for all points, we can compute the weight vector w by taking the summation only for the support vectors: � w = α i y i x i i ,α i > 0 Only the support vectors determine w , since α i = 0 for other points. To compute the bias b , we first compute one solution b i , per support vector, as follows: y i ( w T x i + b ) = 1 , which implies b i = 1 − w T x i = y i − w T x i y i The bias b is taken as the average value: b = avg α i > 0 { b i } Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 13

SVM Classifier Given the optimal hyperplane function h ( x ) = w T x + b , for any new point z , we predict its class as y = sign( h ( z )) = sign( w T z + b ) ˆ where the sign( · ) function returns + 1 if its argument is positive, and − 1 if its argument is negative. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 21: Support Vector Machines 14

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining by

Introduction What is data mining? to Data mining functionalities Data Mining Major

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

CS257 Linear and Convex Optimization Lecture 7 Bo Jiang John Hopcroft Center for Computer

In SMV I IAML: Support Vector Machines II We saw: Max margin trick Nigel Goddard

Support Vector Machines COMP 640 Ryan Spring, Sarah Kim

CS480/680 Lecture 14: June 24, 2019 Support Vector Machines (continued) [B] Sec. 7.1 [D] Sec.

CS473 CS-473 Text Categorization (II) Luo Si Department of Computer Science Purdue University

Rewriting in Practice Ashish Tiwari SRI International Menlo Park, CA 94025 tiwari@csl.sri.com

13.1 Review of Last Lecture Review of primal and dual of SVM. Insights: Dual only depends on

Lecture 22 But not sufficient for the real world: At least 2 key missing pieces System