Support Vector Machines & Kernels Lecture 5 David Sontag New - PowerPoint PPT Presentation

Support Vector Machines & Kernels Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer and Carlos Guestrin

Support Vector Machines QP form: More “natural” form: Equivalent if Regularization Empirical loss term

Subgradient method

Subgradient method Step size:

Stochastic subgradient 1 Subgradient

PEGASOS A_t = S |A_t| = 1 Subgradient method Stochastic gradient 1 Subgradient Projection

Run-Time of Pegasos • Choosing |A t |=1  Run-time required for Pegasos to find ε accurate solution w.p. ¸ 1- δ n = # of features • Run-time does not depend on #examples • Depends on “difficulty” of problem ( λ and ε )

Experiments • 3 datasets (provided by Joachims) – Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features) – Covertype (581k examples, 54 features) Pegasos SVM-Perf SVM-Light 2 77 20,075 Reuters Training Time (in seconds): 6 85 25,514 Covertype 2 5 80 Astro-Physics

What’s Next! • Learn one of the most interesting and exciting recent advancements in machine learning – The “kernel trick” – High dimensional feature spaces at no extra cost • But first, a detour – Constrained optimization!

Constrained optimization x ≥ -1 No Constraint x ≥ 1 x*=0 x*=0 x*=1 How do we solve with constraints?  Lagrange Multipliers!!!

Lagrange multipliers – Dual variables Add Lagrange multiplier Rewrite Constraint Introduce Lagrangian (objective): We will solve: Why is this equivalent? • min is fighting max! x<b  (x-b)<0  max α - α (x-b) = ∞ • min won’t let this happen! Add new constraint x>b, α ≥ 0  (x-b)>0  max α - α (x-b) = 0, α *=0 • min is cool with 0, and L(x, α )=x 2 (original objective) x=b  α can be anything, and L(x, α )=x 2 (original objective) The min on the outside forces max to behave, so constraints will be satisfied.

Dual SVM derivation (1) – the linearly separable case (hard margin SVM) Original optimization problem: One Lagrange multiplier Rewrite per example constraints Lagrangian: Our goal now is to solve:

Dual SVM derivation (2) – the linearly separable case (hard margin SVM) (Primal) Swap min and max (Dual) Slater’s condition from convex optimization guarantees that these two optimization problems are equivalent!

Dual SVM derivation (3) – the linearly separable case (hard margin SVM) (Dual) ⇤ ⌅ Can solve for optimal w , b as function of α : ∂ L ⌥  ∂ w = w − α j y j x j j  Substituting these values back in (and simplifying), we obtain: (Dual) Sums over all training examples scalars dot product

Reminder: What if the data is not linearly separable? Use features of features of features of features…. x (1)   . . .     x ( n )     x (1) x (2)   φ ( x ) =   x (1) x (3)       . . .   e x (1)     . . . Feature space can get really large really quickly!

Higher order polynomials number of monomial terms d=4 m – input features d – degree of polynomial d=3 grows fast! d = 6, m = 100 d=2 about 1.6 billion terms number of input dimensions

Dual formulation only depends on dot-products of the features! First, we introduce a feature mapping :  Next, replace the dot product with an equivalent kernel function:

Polynomial kernel d =1 � u 1 � v 1 � ⇥ ⇥ � ⇥ ⇥ φ ( u ) . φ ( v ) = = u 1 v 1 + u 2 v 2 = u.v . u 2 v 2 u 2 v 2 ⇤ ⌅ ⇤ ⌅ d =2 ⌃ ⇧ ⌃ 1 1 u 1 u 2 v 1 v 2 ⌃ = u 2 1 v 2 1 + 2 u 1 v 1 u 2 v 2 + u 2 2 v 2 ⌥ � ⌥ � φ ( u ) . φ ( v ) = ⌃ . ⌥ � ⌥ � 2 u 2 u 1 v 2 v 1 ⇧ ⇧ = ( u 1 v 1 + u 2 v 2 ) 2 u 2 v 2 2 2 = ( u.v ) 2 For any d (we will skip proof): φ ( u ) . φ ( v ) = ( u.v ) d Polynomials of degree exactly d

Common kernels • Polynomials of degree exactly d • Polynomials of degree up to d • Gaussian kernels • Sigmoid • And many others: very active area of research!

Support Vector Machines & Kernels Lecture 5 David Sontag New - PowerPoint PPT Presentation

Support Vector Machines & Kernels Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer and Carlos Guestrin Support Vector Machines QP form: More natural form: Equivalent if Regularization Empirical loss

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Supporting adaptive classroom behaviour in students who are new refugees Alison McInnes, Ph.D.

Linear Arithmetic Satisfjability via Strategy Improvement July 13, 2016 Azadeh Farzan 1 Zachary

AlphaGo 2/17/17 Video https://www.youtube.com/watch?v=g-dKXOlsf98 Figure from the AlphaGo Paper

Time-optimal Strategies for Infinite Games Martin Zimmermann RWTH Aachen University March 10th,

COMP 516 Research Methods in Computer Science Dominik Wojtczak Department of Computer Science

Strategic Opportunities: Food & Drink & Scottish Peatlands Ceri Ritchie Sector Manager

6 JUNE 2007 Higher Education Data Workshop Outline Data requirements Data quality

Adaptive Adversarial Multi-task Representation Learning Yuren Mao 1 Weiwei Liu 2 Xuemin Lin 1 1.

Support Vector Machines & Kernels Lecture 5 David Sontag New - PowerPoint PPT Presentation

Support Vector Machines & Kernels Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer and Carlos Guestrin Support Vector Machines QP form: More natural form: Equivalent if Regularization Empirical loss

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Supporting adaptive classroom behaviour in students who are new refugees Alison McInnes, Ph.D.

Linear Arithmetic Satisfjability via Strategy Improvement July 13, 2016 Azadeh Farzan 1 Zachary

AlphaGo 2/17/17 Video https://www.youtube.com/watch?v=g-dKXOlsf98 Figure from the AlphaGo Paper

Time-optimal Strategies for Infinite Games Martin Zimmermann RWTH Aachen University March 10th,

COMP 516 Research Methods in Computer Science Dominik Wojtczak Department of Computer Science

Strategic Opportunities: Food &amp; Drink &amp; Scottish Peatlands Ceri Ritchie Sector Manager

6 JUNE 2007 Higher Education Data Workshop Outline Data requirements Data quality

Adaptive Adversarial Multi-task Representation Learning Yuren Mao 1 Weiwei Liu 2 Xuemin Lin 1 1.

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Strategic Opportunities: Food & Drink & Scottish Peatlands Ceri Ritchie Sector Manager