SVMs and Kernel Methods Lecture 3 David Sontag New York University - PowerPoint PPT Presentation

SVMs and Kernel Methods Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin

Today’s lecture  Dual form of soft-margin SVM  Feature mappings & kernels  Convexity, Mercer’s theorem  (Time permitting) Extensions:  Imbalanced data  Multi-class  Other loss functions  L1 regularization

Recap of dual SVM derivation (Dual) ⇤ ⌅ Can solve for optimal w , b as function of α : ∂ L ⌥  ∂ w = w − α j y j x j j  Substituting these values back in (and simplifying), we obtain: (Dual) So, in dual formulation we will solve for α directly! • w and b are computed from α (if needed)

Solving for the offset “b” Lagrangian: α j > 0 for some j implies constraint is tight. We use this to obtain b : (1) (2) (3)

Dual formulation only depends on dot-products of the features! First, we introduce a feature mapping :  Next, replace the dot product with an equivalent kernel function: ~ ↵ ≥ 0 Do kernels need to be symmetric?

Classification rule using dual solution Using dual solution dot product of feature vectors of new example with support vectors Using a kernel function, predict with…

Dual SVM interpretation: Sparsity w . x + b = +1 w . x + b = 0 w . x + b = -1 Final solution tends to be sparse • α j =0 for most j • don’t need to store these points to compute w or make predictions Non-support Vectors: • α j =0 Support Vectors: • moving them will not • α j ≥ 0 change w

Soft-margin SVM Primal: Solve for w,b, α : Dual: What changed? • Added upper bound of C on α i ! • Intuitive explanation: • Without slack, α i  ∞ when constraints are violated (points misclassified) • Upper bound of C limits the α i , so misclassifications are allowed

Common kernels • Polynomials of degree exactly d • Polynomials of degree up to d • Gaussian kernels • Sigmoid • And many others: very active area of research!

Polynomial kernel d =1 � u 1 � v 1 � ⇥ ⇥ � ⇥ ⇥ φ ( u ) . φ ( v ) = = u 1 v 1 + u 2 v 2 = u.v . u 2 v 2 u 2 v 2 ⇤ ⌅ ⇤ ⌅ d =2 ⌃ ⇧ ⌃ 1 1 u 1 u 2 v 1 v 2 ⌃ = u 2 1 v 2 1 + 2 u 1 v 1 u 2 v 2 + u 2 2 v 2 ⌥ � ⌥ � φ ( u ) . φ ( v ) = ⌃ . ⌥ � ⌥ � 2 u 2 u 1 v 2 v 1 ⇧ ⇧ = ( u 1 v 1 + u 2 v 2 ) 2 u 2 v 2 2 2 = ( u.v ) 2 For any d (we will skip proof): φ ( u ) . φ ( v ) = ( u.v ) d Polynomials of degree exactly d

Gaussian kernel Level sets, i.e. for some r w · φ ( x ) = r Support vectors [Cynthia Rudin] [mblondel.org]

Kernel algebra Q: How would you prove that the “Gaussian kernel” is a valid kernel? A: Expand the Euclidean norm as follows: To see that this is a kernel, use the Taylor series expansion of the Then, apply (e) from above exponential, together with repeated application of (a), (b), and (c): The feature mapping is infinite dimensional! [Justin Domke]

Overfitting? • Huge feature space with kernels: should we worry about overfitting? – SVM objective seeks a solution with large margin • Theory says that large margin leads to good generalization (we will see this in a couple of lectures) – But everything overfits sometimes!!! – Can control by: • Setting C • Choosing a better Kernel • Varying parameters of the Kernel (width of Gaussian, etc.)

How to deal with imbalanced data? • In many practical applications we may have imbalanced data sets • We may want errors to be equally distributed between the positive and negative classes • A slight modification to the SVM objective does the trick! Class-specific weighting of the slack variables

How do we do multi-class classification?

One versus all classification w + Learn 3 classifiers: w - • - vs {o,+}, weights w - • + vs {o,-}, weights w + • o vs {+,-}, weights w o w o Predict label using: Any problems? Could we learn this (1-D) dataset?  -1 1 0

Multi-class SVM w + Simultaneously learn 3 sets of weights: w - • How do we guarantee the correct labels? w o • Need new constraints! The “score” of the correct class must be better than the “score” of wrong classes:

Multi-class SVM As for the SVM, we introduce slack variables and maximize margin: To predict, we use: Now can we learn it?  -1 1 0 b + = − . 5

SVMs and Kernel Methods Lecture 3 David Sontag New York University - PowerPoint PPT Presentation

SVMs and Kernel Methods Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Todays lecture Dual form of soft-margin SVM Feature mappings & kernels Convexity,

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

Optimization for Kernel Methods S. Sathiya Keerthi Yahoo! Research, Burbank, CA, USA Kernel

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Support Vector Machines (II): Non-linear SVMs LING 572 Advanced Statistical Methods for NLP

SVMs, Duality and the Kernel Trick (cont.) Machine Learning 10701/15781 Carlos Guestrin

Efficient Structure-Aware Selection Techniques for 3D Point Cloud Visualizations with 2DOF Input

Panel f u nctions DATA VISU AL IZATION W ITH L ATTIC E IN R Deepa y an Sarkar Associate

Density Estimation Parametric techniques Maximum Likelihood Maximum A Posteriori

Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani

CS480/680 Lecture 11: June 12, 2019 Kernel methods [D] Chap. 11 [B] Sec. 6.1, 6.2 [M] Sec.

Scalable Learning in Reproducing Kernel Kre n Spaces Dino Oglic 1 Thomas Grtner 2 1

Kernel methods for Network Analysis: An introduction Chiranjib Bhattacharyya Machine Learning

Explicit Feature Methods for Accelerated Kernel Learning Purushottam Kar Quick Motivation