Pattern Recognition 2018 Support Vector Machines Ad Feelders - PowerPoint PPT Presentation

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48

Support Vector Machines Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 48

Overview 1 Separable Case 2 Kernel Functions 3 Allowing Errors (Soft Margin) 4 SVM’s in R . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 3 / 48

Linear Classifier for two classes Linear model y ( x ) = w ⊤ φ ( x ) + b (7.1) with t n ∈ {− 1 , +1 } . Predict t 0 = +1 if y ( x 0 ) ≥ 0 and t 0 = − 1 otherwise. The decision boundary is given by y ( x ) = 0. This is a linear classifier in feature space φ ( x ). Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 4 / 48

Mapping y ( x ) = w ⊤ φ ( x ) + b = 0 φ φ maps x into higher dimensional space where data is linearly separable. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 5 / 48

Data linearly separable Assume training data is linearly separable in feature space, so there is at least one choice of w , b such that: 1 y ( x n ) > 0 for t n = +1; 2 y ( x n ) < 0 for t n = − 1; that is, all training points are classified correctly. Putting 1. and 2. together: t n y ( x n ) > 0 for n = 1 , . . . , N Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 6 / 48

Maximum Margin There may be many solutions that separate the classes exactly. Which one gives smallest prediction error? SVM chooses line with maximal margin , where the margin is the distance between the line and the closest data point. In this way, it avoids “low confidence” classifications. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 7 / 48

Two-class training data Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 8 / 48

Many Linear Separators Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 9 / 48

SVM Decision Boundary Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 10 / 48

Maximize Margin Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 11 / 48

Support Vectors Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 12 / 48

Weight vector is orthogonal to the decision boundary Consider two points x A and x B both of which lie on the decision surface. Because y ( x A ) = y ( x B ) = 0, we have ( w ⊤ x A + b ) − ( w ⊤ x B + b ) = w ⊤ ( x A − x B ) = 0 and so the vector w is orthogonal to the decision surface. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 13 / 48

Distance of a point to a line x 2 x y ( x ) = w ⊤ x + b = 0 r w x ⊥ x 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 14 / 48

Distance to decision surface ( φ ( x ) = x ) We have x = x ⊥ + r w � w � . (4.6) where � w � is the unit vector in the direction of w , x ⊥ is the orthogonal w projection of x onto the line y ( x ) = 0, and r is the (signed) distance of x to the line. Multiply (4.6) left and right by w ⊤ and add b : + r w ⊤ w w ⊤ x + b = w ⊤ x ⊥ + b � �� w � y ( x ) 0 So we get r = y ( x ) � w � � w � 2 = y ( x ) (4.7) � w � Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 15 / 48

Distance of a point to a line The signed distance of x n to the decision boundary is r = y ( x n ) � w � For lines that separate the data perfectly, we have t n y ( x n ) = | y ( x n ) | , so that the distance is given by = t n ( w ⊤ φ ( x n ) + b ) t n y ( x n ) (7.2) � w � � w � Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 16 / 48

Maximum margin solution Now we are ready to define the optimization problem: � � t n ( w ⊤ φ ( x n ) + b ) �� arg max min . (7.3) � w � n w , b 1 Since � w � does not depend on n , it can be moved outside of the minimization: � 1 � n [ t n ( w ⊤ φ ( x n ) + b )] arg max � w � min . (7.3) w , b Direct solution of this problem would be rather complex. A more convenient representation is possible. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 17 / 48

Canonical Representation The hyperplane (decision boundary) is defined by w ⊤ φ ( x ) + b = 0 Then also κ ( w ⊤ φ ( x ) + b ) = κ w ⊤ φ ( x ) + κ b = 0 so rescaling w → κ w and b → κ b gives just another representation of the same decision boundary. To resolve this ambiguity, we choose the scaling factor such that t i ( w ⊤ φ ( x i ) + b ) = 1 (7.4) for the points x i closest to the decision boundary. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 18 / 48

Canonical Representation (square=1,circle= − 1) y ( x ) = 1 y ( x ) = 0 y ( x ) = − 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 19 / 48

Canonical Representation In this case we have t n ( w ⊤ φ ( x n ) + b ) ≥ 1 n = 1 , . . . , N (7.5) Quadratic program 1 2 � w � 2 arg min (7.6) w , b subject to the constraints (7.5). This optimization problem has a unique global minimum. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 20 / 48

Lagrangian Function Introduce Lagrange multipliers a n ≥ 0 to get Lagrangian function N � L ( w , b , a ) = 1 2 � w � 2 − a n { t n ( w ⊤ φ ( x n ) + b ) − 1 } (7.7) n =1 with N � ∂ L ( w , b , a ) = w − a n t n φ ( x n ) ∂ w n =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 21 / 48

Lagrangian Function and for b : N � ∂ L ( w , b , a ) = − a n t n ∂ b n =1 Equating the derivatives to zero yields the conditions: N � w = a n t n φ ( x n ) (7.8) n =1 and N � a n t n = 0 (7.9) n =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 22 / 48

Dual Representation Eliminating w and b from L ( w , b , a ) gives the dual representation . N � 1 2 � w � 2 − a n { t n ( w ⊤ φ ( x n ) + b ) − 1 } L ( w , b , a ) = n =1 N N N � � � 1 2 � w � 2 − a n t n w ⊤ φ ( x n ) − b = a n t n + a n n =1 n =1 n =1 N N � � 1 a n a m t n t m φ ( x n ) ⊤ φ ( x m ) = 2 n =1 m =1 N N N � � � a n t n a m t m φ ( x n ) ⊤ φ ( x m ) + − a n n =1 m =1 n =1 N N N � � � a n − 1 a n t n a m t m φ ( x n ) ⊤ φ ( x m ) = 2 n =1 n =1 m =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 23 / 48

Dual Representation Maximize N N � � a n − 1 � a n t n a m t m φ ( x n ) ⊤ φ ( x m ) L ( a ) = (7.10) 2 n =1 n , m =1 with respect to a and subject to the constraints a n ≥ 0 , n = 1 , . . . , N (7.11) N � a n t n = 0 . (7.12) n =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 24 / 48

Kernel Function We map x to a high-dimensional space φ ( x ) in which data is linearly separable. Performing computations in this high-dimensional space may be very expensive. Use a kernel function k that computes a dot product in this space without making the actual mapping (“kernel trick”): k ( x , x ′ ) = φ ( x ) ⊤ φ ( x ′ ) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 25 / 48

Example: polynomial kernel R 3 and φ ( x ) ∈ I R 10 with Suppose x ∈ I √ √ √ √ √ √ 2 x 3 , x 2 1 , x 2 2 , x 2 φ ( x ) = (1 , 2 x 1 , 2 x 2 , 3 , 2 x 1 x 2 , 2 x 1 x 3 , 2 x 2 x 3 ) Then φ ( x ) ⊤ φ ( z ) 1 + 2 x 1 z 1 + 2 x 2 z 2 + 2 x 3 z 3 + x 2 1 z 2 1 + x 2 2 z 2 2 + x 2 3 z 2 = 3 + 2 x 1 x 2 z 1 z 2 + 2 x 1 x 3 z 1 z 3 + 2 x 2 x 3 z 2 z 3 But this can be written as (1 + x ⊤ z ) 2 = (1 + x 1 z 1 + x 2 z 2 + x 3 z 3 ) 2 which costs much less operations to compute. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 26 / 48

Polynomial kernel: numeric example Suppose x = (3 , 2 , 6) and z = (4 , 1 , 5). Then √ √ √ √ √ √ φ ( x ) = (1 , 3 2 , 2 2 , 6 2 , 9 , 4 , 36 , 6 2 , 18 2 , 12 2) √ √ √ √ √ √ φ ( z ) = (1 , 4 2 , 1 2 , 5 2 , 16 , 1 , 25 , 4 2 , 20 2 , 5 2) Then φ ( x ) ⊤ φ ( z ) = 1 + 24 + 4 + 60 + 144 + 4 + 900 + 48 + 720 + 120 = 2025 . But (1 + x ⊤ z ) 2 = (1 + (3)(4) + (2)(1) + (6)(5)) 2 = 45 2 = 2025 is a more efficient way to compute this dot product. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 27 / 48

Kernels Linear kernel k ( x , x ′ ) = x ⊤ x ′ Two popular non-linear kernels are the polynomial kernel (of degree M ): k ( x , x ′ ) = ( x ⊤ x ′ + c ) M and Gaussian (or radial) kernel: k ( x , x ′ ) = exp( −� x − x ′ � 2 / 2 σ 2 ) , (6.23) or k ( x , x ′ ) = exp( − γ � x − x ′ � 2 ) , 1 where γ = 2 σ 2 . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 28 / 48

Dual Representation with kernels Using k ( x , x ′ ) = φ ( x ) ⊤ φ ( x ′ ) we get dual representation: Maximize N N � � a n − 1 � L ( a ) = a n t n a m t m k ( x n , x m ) (7.10) 2 n =1 n , m =1 with respect to a and subject to the constraints a n ≥ 0 , n = 1 , . . . , N (7.11) N � a n t n = 0 . (7.12) n =1 Is this dual “easier” than the original problem? Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 29 / 48

Prediction Recall that y ( x ) = w ⊤ φ ( x ) + b (7.1) Substituting N � w = a n t n φ ( x n ) (7.8) n =1 into (7.1), we get N � y ( x ) = b + a n t n k ( x , x n ) (7.13) n =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 30 / 48

Pattern Recognition 2018 Support Vector Machines Ad Feelders - PowerPoint PPT Presentation

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 48

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Part 5 pattern recognition pattern recognition track pattern recognition: associate hits

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Feature Selection Pattern Recognition: The Early Days Pattern Recognition: The Early Days Only

Gender Classification with Support vector machines (SVMs) Support Vector Machines The 3

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall

Combining Evidence Module Introduction CS6200: Information Retrieval Evidence of Relevance So

Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48

Software Verification : Introduction Ranjit Jhala, UC San Diego April 4, 2013 What is

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels

Decision support systems and machine learning Lecture 11 Lecture 11 p. 1/24 Neural networks:

Sambuz

Useful Links

Newsletter

Mail Us

Pattern Recognition 2018 Support Vector Machines Ad Feelders - PowerPoint PPT Presentation

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 48

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Part 5 pattern recognition pattern recognition track pattern recognition: associate hits

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Feature Selection Pattern Recognition: The Early Days Pattern Recognition: The Early Days Only

Gender Classification with Support vector machines (SVMs) Support Vector Machines The 3

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall

Combining Evidence Module Introduction CS6200: Information Retrieval Evidence of Relevance So

Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48

Software Verification : Introduction Ranjit Jhala, UC San Diego April 4, 2013 What is

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos &amp; Aarti Singh 1

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun &amp; Rich Zemels

Decision support systems and machine learning Lecture 11 Lecture 11 p. 1/24 Neural networks:

Sambuz

Useful Links

Newsletter

Mail Us

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels