Introduction to Machine Learning 5. Support Vector Classification - PowerPoint PPT Presentation

Introduction to Machine Learning 5. Support Vector Classification Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701

Outline • Support Vector Classification Large Margin Separation, optimization problem • Properties Support Vectors, kernel expansion • Soft margin classifier Dual problem, robustness

Support Vector Machines http://maktoons.blogspot.com/2009/03/support-vector-machine.html

Linear Separator Ham Spam

Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 linear function f ( x ) = h w, x i + b

Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w margin h x + � x − , w i 1 1 = 2 k w k [[ h x + , w i + b ] � [ h x − , w i + b ]] = 2 k w k k w k

Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 maximize k w k subject to y i [ h x i , w i + b ] � 1 w,b

Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b

Dual Problem • Primal optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b constraint • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i Optimality in w, b is at saddle point with α • Derivatives in w, b need to vanish

Dual Problem • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i • Derivatives in w, b need to vanish X ∂ w L ( w, b, a ) = w − α i y i x i = 0 i X ∂ b L ( w, b, a ) = α i y i = 0 i • Plugging terms back into L yields � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i X α i y i = 0 and α i � 0 subject to

Support Vector Machines 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i X α i y i = 0 and α i � 0 subject to i

Support Vectors 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w Karush Kuhn Tucker α i = 0 Optimality condition α i > 0 = ) y i [ h w, x i i + b ] = 1 α i [ y i [ h w, x i i + b ] � 1] = 0

Properties X w = y i α i x i i w • Weight vector w as weighted linear combination of instances • Only points on margin matter (ignore the rest and get same solution) • Only inner products matter • Quadratic program • We can replace the inner product by a kernel • Keeps instances away from the margin

Example

Why large margins? • Maximum robustness relative o to uncertainty r o o • Symmetry breaking + • Independent of correctly classified o + instances ρ • Easy to find for + easy problems +

Support Vector CLASSIFIERS Machines

Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 linear function f ( x ) = h w, x i + b

Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 linear separator linear function is impossible f ( x ) = h w, x i + b

Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard

Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 minimum error separator is impossible Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard

Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ Convex optimization problem

Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ minimize amount of slack Convex optimization problem

Intermezzo Convex Programs for Dummies • Primal optimization problem minimize f ( x ) subject to c i ( x ) ≤ 0 x • Lagrange function X L ( x, α ) = f ( x ) + α i c i ( x ) i • First order optimality conditions in x X ∂ x L ( x, α ) = ∂ x f ( x ) + α i ∂ x c i ( x ) = 0 i • Solve for x and plug it back into L maximize L ( x ( α ) , α ) α (keep explicit constraints)

Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ Convex optimization problem

Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ minimize amount of slack Convex optimization problem

Adding slack variables • Hard margin problem 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b • With slack variables 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 Problem is always feasible. Proof: (also yields upper bound) w = 0 and b = 0 and ξ i = 1

Dual Problem • Primal optimization problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 • Lagrange function L ( w, b, α ) = 1 2 k w k 2 + C X X X ξ i � α i [ y i [ h x i , w i + b ] + ξ i � 1] � η i ξ i i i i Optimality in w,b, ξ is at saddle point with α , η • Derivatives in w,b, ξ need to vanish

Dual Problem • Lagrange function L ( w, b, α ) = 1 2 k w k 2 + C X X X ξ i � α i [ y i [ h x i , w i + b ] + ξ i � 1] � η i ξ i i i i • Derivatives in w, b need to vanish X ∂ w L ( w, b, ξ , α , η ) = w − α i y i x i = 0 i X ∂ b L ( w, b, ξ , α , η ) = α i y i = 0 i ∂ ξ i L ( w, b, ξ , α , η ) = C − α i − η i = 0 • Plugging terms back into L yields bound � 1 X X maximize α i α j y i y j h x i , x j i + influence α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i

Karush Kuhn Tucker Conditions � 1 X X maximize α i α j y i y j h x i , x j i + α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i X w = y i α i x i i w α i = 0 = ) y i [ h w, x i i + b ] � 1 α i [ y i [ h w, x i i + b ] + ξ i � 1] = 0 0 < α i < C = ) y i [ h w, x i i + b ] = 1 η i ξ i = 0 α i = C = ) y i [ h w, x i i + b ]  1

Solving the optimization problem • Dual problem � 1 X X maximize α i α j y i y j h x i , x j i + α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i • If problem is small enough (1000s of variables) we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO) • For larger problem use fact that only SVs matter and solve in blocks (active set method).

Nonlinear Separation

The Kernel Trick • Linear soft margin problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 • Dual problem � 1 X X maximize α i α j y i y j h x i , x j i + α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i • Support vector expansion X f ( x ) = α i y i h x i , x i + b i

The Kernel Trick • Linear soft margin problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, φ ( x i ) i + b ] � 1 � ξ i and ξ i � 0 • Dual problem − 1 X X maximize α i α j y i y j k ( x i , x j ) + α i 2 α i,j i X subject to α i y i = 0 and α i ∈ [0 , C ] • Support vector expansion i X f ( x ) = α i y i k ( x i , x ) + b i

Introduction to Machine Learning 5. Support Vector Classification - PowerPoint PPT Presentation

Introduction to Machine Learning 5. Support Vector Classification Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Support Vector Classification Large Margin Separation, optimization

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

spam, ham and other food or how to distribute spam to 100k email addresses Who am I? Debian

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke

Ham Sandwich Theorem Carola Wenk 3/8/16 1 CMPS 6640/4040 Computational Geometry Ham-Sandwich

Mining personal media thresholds for opinion dynamics and social influence Alex Meandzija

Topic 32 - Polymorphism Clicker 1 What is output by the following code? Critter c1 = new

Designate Project Update Graham Hayes - mugsie / gr@ham.ie Erik Olof Gunnar Andersson -

DNS in OpenStack What is the OpenStack DNS API? https://gra.ham.ie | @grahamhayes 1 Graham

Seismic Attenuation System Synthesis by Reduced Order Models from Multibody Analysis Valerio