Machine Learning Basics Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang
Review: SVM objective
SVM: objective π₯,π π¦ = π₯ π π¦ + π . Margin: β’ Let π§ π β +1, β1 , π π§ π π π₯,π π¦ π πΏ = min | π₯ | π β’ Support Vector Machine: π§ π π π₯,π π¦ π max π₯,π πΏ = max π₯,π min | π₯ | π
SVM: optimization β’ Optimization (Quadratic Programming): 2 1 min π₯ 2 π₯,π π§ π π₯ π π¦ π + π β₯ 1, βπ β’ Solved by Lagrange multiplier method: 2 β π₯, π, π· = 1 π½ π [π§ π π₯ π π¦ π + π β 1] 2 π₯ β ΰ· π where π· is the Lagrange multiplier
Lagrange multiplier
Lagrangian β’ Consider optimization problem: min π(π₯) π₯ β π π₯ = 0, β1 β€ π β€ π β’ Lagrangian: β π₯, πΈ = π π₯ + ΰ· πΎ π β π (π₯) π where πΎ π βs are called Lagrange multipliers
Lagrangian β’ Consider optimization problem: min π(π₯) π₯ β π π₯ = 0, β1 β€ π β€ π β’ Solved by setting derivatives of Lagrangian to 0 πβ πβ = 0; = 0 ππ₯ π ππΎ π
Generalized Lagrangian β’ Consider optimization problem: min π(π₯) π₯ π π π₯ β€ 0, β1 β€ π β€ π β π π₯ = 0, β1 β€ π β€ π β’ Generalized Lagrangian: β π₯, π·, πΈ = π π₯ + ΰ· π½ π π π (π₯) + ΰ· πΎ π β π (π₯) π π where π½ π , πΎ π βs are called Lagrange multipliers
Generalized Lagrangian β’ Consider the quantity: π π π₯ β π·,πΈ:π½ π β₯0 β π₯, π·, πΈ max β’ Why? π π π₯ = απ π₯ , if π₯ satisfies all the constraints +β, if π₯ does not satisfy the constraints β’ So minimizing π π₯ is the same as minimizing π π π₯ min π₯ π π₯ = min π₯ π π π₯ = min π·,πΈ:π½ π β₯0 β π₯, π·, πΈ max π₯
Lagrange duality β’ The primal problem π β β min π₯ π π₯ = min π·,πΈ:π½ π β₯0 β π₯, π·, πΈ max π₯ β’ The dual problem π β β π·,πΈ:π½ π β₯0 min max π₯ β π₯, π·, πΈ β’ Always true: π β β€ π β
Lagrange duality β’ The primal problem π β β min π₯ π π₯ = min π·,πΈ:π½ π β₯0 β π₯, π·, πΈ max π₯ β’ The dual problem π β β π·,πΈ:π½ π β₯0 min max π₯ β π₯, π·, πΈ β’ Interesting case: when do we have π β = π β ?
Lagrange duality β’ Theorem: under proper conditions, there exists π₯ β , π· β , πΈ β such that π β = β π₯ β , π· β , πΈ β = π β Moreover, π₯ β , π· β , πΈ β satisfy Karush-Kuhn-Tucker (KKT) conditions: πβ = 0, π½ π π π π₯ = 0 ππ₯ π π π π₯ β€ 0, β π π₯ = 0, π½ π β₯ 0
Lagrange duality dual complementarity β’ Theorem: under proper conditions, there exists π₯ β , π· β , πΈ β such that π β = β π₯ β , π· β , πΈ β = π β Moreover, π₯ β , π· β , πΈ β satisfy Karush-Kuhn-Tucker (KKT) conditions: πβ = 0, π½ π π π π₯ = 0 ππ₯ π π π π₯ β€ 0, β π π₯ = 0, π½ π β₯ 0
Lagrange duality β’ Theorem: under proper conditions, there exists π₯ β , π· β , πΈ β such that primal constraints dual constraints π β = β π₯ β , π· β , πΈ β = π β β’ Moreover, π₯ β , π· β , πΈ β satisfy Karush-Kuhn-Tucker (KKT) conditions: πβ = 0, π½ π π π π₯ = 0 ππ₯ π π π π₯ β€ 0, β π π₯ = 0, π½ π β₯ 0
Lagrange duality β’ What are the proper conditions? β’ A set of conditions (Slater conditions): β’ π, π π convex, β π affine β’ Exists π₯ satisfying all π π π₯ < 0 β’ There exist other sets of conditions β’ Search Karush β Kuhn β Tucker conditions on Wikipedia
SVM: optimization
SVM: optimization β’ Optimization (Quadratic Programming): 2 1 min π₯ 2 π₯,π π§ π π₯ π π¦ π + π β₯ 1, βπ β’ Generalized Lagrangian: 2 β π₯, π, π· = 1 π½ π [π§ π π₯ π π¦ π + π β 1] 2 π₯ β ΰ· π where π· is the Lagrange multiplier
SVM: optimization β’ KKT conditions: πβ ππ₯ = 0, ο π₯ = Ο π π½ π π§ π π¦ π (1) πβ ππ = 0, ο 0 = Ο π π½ π π§ π (2) β’ Plug into β : 1 π π¦ π (3) β π₯, π, π· = Ο π π½ π β 2 Ο ππ π½ π π½ π π§ π π§ π π¦ π combined with 0 = Ο π π½ π π§ π , π½ π β₯ 0
Only depend on inner products SVM: optimization β’ Reduces to dual problem: π½ π β 1 π π¦ π β π₯, π, π· = ΰ· 2 ΰ· π½ π π½ π π§ π π§ π π¦ π π ππ ΰ· π½ π π§ π = 0, π½ π β₯ 0 π π π¦ + π β’ Since π₯ = Ο π π½ π π§ π π¦ π , we have π₯ π π¦ + π = Ο π π½ π π§ π π¦ π
Kernel methods
Features π¦ π π¦ Color Histogram Extract features Red Green Blue
Features
Features β’ Proper feature mapping can make non-linear to linear β’ Using SVM on the feature space {π π¦ π } : only need π π¦ π π π(π¦ π ) β’ Conclusion: no need to design π β , only need to design π π¦ π , π¦ π = π π¦ π π π(π¦ π )
Polynomial kernels β’ Fix degree π and constant π : π π¦, π¦β² = π¦ π π¦β² + π π β’ What are π(π¦) ? β’ Expand the expression to get π(π¦)
Polynomial kernels Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
Gaussian kernels β’ Fix bandwidth π : 2 /2π 2 ) π π¦, π¦β² = exp(β π¦ β π¦ β² β’ Also called radial basis function (RBF) kernels β’ What are π(π¦) ? Consider the un-normalized version πβ² π¦, π¦β² = exp(π¦ π π¦β²/π 2 ) β’ Power series expansion: +β π¦ π π¦ β² π πβ² π¦, π¦ β² = ΰ· π π π! π
Mercerβs condition for kenerls β’ Theorem: π π¦, π¦β² has expansion +β π π π π π¦ π π (π¦ β² ) π π¦, π¦β² = ΰ· π if and only if for any function π(π¦) , β« β« π π¦ π π¦ β² π π¦, π¦ β² ππ¦ππ¦ β² β₯ 0 (Omit some math conditions for π and π )
Constructing new kernels β’ Kernels are closed under positive scaling, sum, product, pointwise +β π π π π (π¦, π¦ β² ) limit, and composition with a power series Ο π β’ Example: π 1 π¦, π¦β² , π 2 π¦, π¦β² are kernels, then also is π π¦, π¦ β² = 2π 1 π¦, π¦β² + 3π 2 π¦, π¦β² β’ Example: π 1 π¦, π¦β² is kernel, then also is π π¦, π¦ β² = exp(π 1 π¦, π¦ β² )
Kernels v.s. Neural networks
Features π¦ Color Histogram Extract build π§ = π₯ π π π¦ features hypothesis Red Green Blue
Features: part of the model Nonlinear model build π§ = π₯ π π π¦ hypothesis Linear model
Polynomial kernels Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
Polynomial kernel SVM as two layer neural network 2 π¦ 1 2 π¦ 2 π¦ 1 2π¦ 1 π¦ 2 π§ = sign(π₯ π π(π¦) + π) 2ππ¦ 1 π¦ 2 2ππ¦ 2 π First layer is fixed. If also learn first layer, it becomes two layer neural network
Recommend
More recommend