lecture 5 svm ii
play

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu - PowerPoint PPT Presentation

Machine Learning Basics Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective SVM: objective , = + . Margin: Let +1, 1 , ,


  1. Machine Learning Basics Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang

  2. Review: SVM objective

  3. SVM: objective π‘₯,𝑐 𝑦 = π‘₯ π‘ˆ 𝑦 + 𝑐 . Margin: β€’ Let 𝑧 𝑗 ∈ +1, βˆ’1 , 𝑔 𝑧 𝑗 𝑔 π‘₯,𝑐 𝑦 𝑗 𝛿 = min | π‘₯ | 𝑗 β€’ Support Vector Machine: 𝑧 𝑗 𝑔 π‘₯,𝑐 𝑦 𝑗 max π‘₯,𝑐 𝛿 = max π‘₯,𝑐 min | π‘₯ | 𝑗

  4. SVM: optimization β€’ Optimization (Quadratic Programming): 2 1 min π‘₯ 2 π‘₯,𝑐 𝑧 𝑗 π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐 β‰₯ 1, βˆ€π‘— β€’ Solved by Lagrange multiplier method: 2 β„’ π‘₯, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐 βˆ’ 1] 2 π‘₯ βˆ’ ෍ 𝑗 where 𝜷 is the Lagrange multiplier

  5. Lagrange multiplier

  6. Lagrangian β€’ Consider optimization problem: min 𝑔(π‘₯) π‘₯ β„Ž 𝑗 π‘₯ = 0, βˆ€1 ≀ 𝑗 ≀ π‘š β€’ Lagrangian: β„’ π‘₯, 𝜸 = 𝑔 π‘₯ + ෍ 𝛾 𝑗 β„Ž 𝑗 (π‘₯) 𝑗 where 𝛾 𝑗 ’s are called Lagrange multipliers

  7. Lagrangian β€’ Consider optimization problem: min 𝑔(π‘₯) π‘₯ β„Ž 𝑗 π‘₯ = 0, βˆ€1 ≀ 𝑗 ≀ π‘š β€’ Solved by setting derivatives of Lagrangian to 0 πœ–β„’ πœ–β„’ = 0; = 0 πœ–π‘₯ 𝑗 πœ–π›Ύ 𝑗

  8. Generalized Lagrangian β€’ Consider optimization problem: min 𝑔(π‘₯) π‘₯ 𝑕 𝑗 π‘₯ ≀ 0, βˆ€1 ≀ 𝑗 ≀ 𝑙 β„Ž π‘˜ π‘₯ = 0, βˆ€1 ≀ π‘˜ ≀ π‘š β€’ Generalized Lagrangian: β„’ π‘₯, 𝜷, 𝜸 = 𝑔 π‘₯ + ෍ 𝛽 𝑗 𝑕 𝑗 (π‘₯) + ෍ 𝛾 π‘˜ β„Ž π‘˜ (π‘₯) 𝑗 π‘˜ where 𝛽 𝑗 , 𝛾 π‘˜ ’s are called Lagrange multipliers

  9. Generalized Lagrangian β€’ Consider the quantity: πœ„ 𝑄 π‘₯ ≔ 𝜷,𝜸:𝛽 𝑗 β‰₯0 β„’ π‘₯, 𝜷, 𝜸 max β€’ Why? πœ„ 𝑄 π‘₯ = α‰Šπ‘” π‘₯ , if π‘₯ satisfies all the constraints +∞, if π‘₯ does not satisfy the constraints β€’ So minimizing 𝑔 π‘₯ is the same as minimizing πœ„ 𝑄 π‘₯ min π‘₯ 𝑔 π‘₯ = min π‘₯ πœ„ 𝑄 π‘₯ = min 𝜷,𝜸:𝛽 𝑗 β‰₯0 β„’ π‘₯, 𝜷, 𝜸 max π‘₯

  10. Lagrange duality β€’ The primal problem π‘ž βˆ— ≔ min π‘₯ 𝑔 π‘₯ = min 𝜷,𝜸:𝛽 𝑗 β‰₯0 β„’ π‘₯, 𝜷, 𝜸 max π‘₯ β€’ The dual problem 𝑒 βˆ— ≔ 𝜷,𝜸:𝛽 𝑗 β‰₯0 min max π‘₯ β„’ π‘₯, 𝜷, 𝜸 β€’ Always true: 𝑒 βˆ— ≀ π‘ž βˆ—

  11. Lagrange duality β€’ The primal problem π‘ž βˆ— ≔ min π‘₯ 𝑔 π‘₯ = min 𝜷,𝜸:𝛽 𝑗 β‰₯0 β„’ π‘₯, 𝜷, 𝜸 max π‘₯ β€’ The dual problem 𝑒 βˆ— ≔ 𝜷,𝜸:𝛽 𝑗 β‰₯0 min max π‘₯ β„’ π‘₯, 𝜷, 𝜸 β€’ Interesting case: when do we have 𝑒 βˆ— = π‘ž βˆ— ?

  12. Lagrange duality β€’ Theorem: under proper conditions, there exists π‘₯ βˆ— , 𝜷 βˆ— , 𝜸 βˆ— such that 𝑒 βˆ— = β„’ π‘₯ βˆ— , 𝜷 βˆ— , 𝜸 βˆ— = π‘ž βˆ— Moreover, π‘₯ βˆ— , 𝜷 βˆ— , 𝜸 βˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions: πœ–β„’ = 0, 𝛽 𝑗 𝑕 𝑗 π‘₯ = 0 πœ–π‘₯ 𝑗 𝑕 𝑗 π‘₯ ≀ 0, β„Ž π‘˜ π‘₯ = 0, 𝛽 𝑗 β‰₯ 0

  13. Lagrange duality dual complementarity β€’ Theorem: under proper conditions, there exists π‘₯ βˆ— , 𝜷 βˆ— , 𝜸 βˆ— such that 𝑒 βˆ— = β„’ π‘₯ βˆ— , 𝜷 βˆ— , 𝜸 βˆ— = π‘ž βˆ— Moreover, π‘₯ βˆ— , 𝜷 βˆ— , 𝜸 βˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions: πœ–β„’ = 0, 𝛽 𝑗 𝑕 𝑗 π‘₯ = 0 πœ–π‘₯ 𝑗 𝑕 𝑗 π‘₯ ≀ 0, β„Ž π‘˜ π‘₯ = 0, 𝛽 𝑗 β‰₯ 0

  14. Lagrange duality β€’ Theorem: under proper conditions, there exists π‘₯ βˆ— , 𝜷 βˆ— , 𝜸 βˆ— such that primal constraints dual constraints 𝑒 βˆ— = β„’ π‘₯ βˆ— , 𝜷 βˆ— , 𝜸 βˆ— = π‘ž βˆ— β€’ Moreover, π‘₯ βˆ— , 𝜷 βˆ— , 𝜸 βˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions: πœ–β„’ = 0, 𝛽 𝑗 𝑕 𝑗 π‘₯ = 0 πœ–π‘₯ 𝑗 𝑕 𝑗 π‘₯ ≀ 0, β„Ž π‘˜ π‘₯ = 0, 𝛽 𝑗 β‰₯ 0

  15. Lagrange duality β€’ What are the proper conditions? β€’ A set of conditions (Slater conditions): β€’ 𝑔, 𝑕 𝑗 convex, β„Ž π‘˜ affine β€’ Exists π‘₯ satisfying all 𝑕 𝑗 π‘₯ < 0 β€’ There exist other sets of conditions β€’ Search Karush – Kuhn – Tucker conditions on Wikipedia

  16. SVM: optimization

  17. SVM: optimization β€’ Optimization (Quadratic Programming): 2 1 min π‘₯ 2 π‘₯,𝑐 𝑧 𝑗 π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐 β‰₯ 1, βˆ€π‘— β€’ Generalized Lagrangian: 2 β„’ π‘₯, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 π‘₯ π‘ˆ 𝑦 𝑗 + 𝑐 βˆ’ 1] 2 π‘₯ βˆ’ ෍ 𝑗 where 𝜷 is the Lagrange multiplier

  18. SVM: optimization β€’ KKT conditions: πœ–β„’ πœ–π‘₯ = 0, οƒ  π‘₯ = Οƒ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 (1) πœ–β„’ πœ–π‘ = 0, οƒ  0 = Οƒ 𝑗 𝛽 𝑗 𝑧 𝑗 (2) β€’ Plug into β„’ : 1 π‘ˆ 𝑦 π‘˜ (3) β„’ π‘₯, 𝑐, 𝜷 = Οƒ 𝑗 𝛽 𝑗 βˆ’ 2 Οƒ π‘—π‘˜ 𝛽 𝑗 𝛽 π‘˜ 𝑧 𝑗 𝑧 π‘˜ 𝑦 𝑗 combined with 0 = Οƒ 𝑗 𝛽 𝑗 𝑧 𝑗 , 𝛽 𝑗 β‰₯ 0

  19. Only depend on inner products SVM: optimization β€’ Reduces to dual problem: 𝛽 𝑗 βˆ’ 1 π‘ˆ 𝑦 π‘˜ β„’ π‘₯, 𝑐, 𝜷 = ෍ 2 ෍ 𝛽 𝑗 𝛽 π‘˜ 𝑧 𝑗 𝑧 π‘˜ 𝑦 𝑗 𝑗 π‘—π‘˜ ෍ 𝛽 𝑗 𝑧 𝑗 = 0, 𝛽 𝑗 β‰₯ 0 𝑗 π‘ˆ 𝑦 + 𝑐 β€’ Since π‘₯ = Οƒ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 , we have π‘₯ π‘ˆ 𝑦 + 𝑐 = Οƒ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗

  20. Kernel methods

  21. Features 𝑦 𝜚 𝑦 Color Histogram Extract features Red Green Blue

  22. Features

  23. Features β€’ Proper feature mapping can make non-linear to linear β€’ Using SVM on the feature space {𝜚 𝑦 𝑗 } : only need 𝜚 𝑦 𝑗 π‘ˆ 𝜚(𝑦 π‘˜ ) β€’ Conclusion: no need to design 𝜚 β‹… , only need to design 𝑙 𝑦 𝑗 , 𝑦 π‘˜ = 𝜚 𝑦 𝑗 π‘ˆ 𝜚(𝑦 π‘˜ )

  24. Polynomial kernels β€’ Fix degree 𝑒 and constant 𝑑 : 𝑙 𝑦, 𝑦′ = 𝑦 π‘ˆ 𝑦′ + 𝑑 𝑒 β€’ What are 𝜚(𝑦) ? β€’ Expand the expression to get 𝜚(𝑦)

  25. Polynomial kernels Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

  26. Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

  27. Gaussian kernels β€’ Fix bandwidth 𝜏 : 2 /2𝜏 2 ) 𝑙 𝑦, 𝑦′ = exp(βˆ’ 𝑦 βˆ’ 𝑦 β€² β€’ Also called radial basis function (RBF) kernels β€’ What are 𝜚(𝑦) ? Consider the un-normalized version 𝑙′ 𝑦, 𝑦′ = exp(𝑦 π‘ˆ 𝑦′/𝜏 2 ) β€’ Power series expansion: +∞ 𝑦 π‘ˆ 𝑦 β€² 𝑗 𝑙′ 𝑦, 𝑦 β€² = ෍ 𝜏 𝑗 𝑗! 𝑗

  28. Mercer’s condition for kenerls β€’ Theorem: 𝑙 𝑦, 𝑦′ has expansion +∞ 𝑏 𝑗 𝜚 𝑗 𝑦 𝜚 𝑗 (𝑦 β€² ) 𝑙 𝑦, 𝑦′ = ෍ 𝑗 if and only if for any function 𝑑(𝑦) , ∫ ∫ 𝑑 𝑦 𝑑 𝑦 β€² 𝑙 𝑦, 𝑦 β€² 𝑒𝑦𝑒𝑦 β€² β‰₯ 0 (Omit some math conditions for 𝑙 and 𝑑 )

  29. Constructing new kernels β€’ Kernels are closed under positive scaling, sum, product, pointwise +∞ 𝑏 𝑗 𝑙 𝑗 (𝑦, 𝑦 β€² ) limit, and composition with a power series Οƒ 𝑗 β€’ Example: 𝑙 1 𝑦, 𝑦′ , 𝑙 2 𝑦, 𝑦′ are kernels, then also is 𝑙 𝑦, 𝑦 β€² = 2𝑙 1 𝑦, 𝑦′ + 3𝑙 2 𝑦, 𝑦′ β€’ Example: 𝑙 1 𝑦, 𝑦′ is kernel, then also is 𝑙 𝑦, 𝑦 β€² = exp(𝑙 1 𝑦, 𝑦 β€² )

  30. Kernels v.s. Neural networks

  31. Features 𝑦 Color Histogram Extract build 𝑧 = π‘₯ π‘ˆ 𝜚 𝑦 features hypothesis Red Green Blue

  32. Features: part of the model Nonlinear model build 𝑧 = π‘₯ π‘ˆ 𝜚 𝑦 hypothesis Linear model

  33. Polynomial kernels Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

  34. Polynomial kernel SVM as two layer neural network 2 𝑦 1 2 𝑦 2 𝑦 1 2𝑦 1 𝑦 2 𝑧 = sign(π‘₯ π‘ˆ 𝜚(𝑦) + 𝑐) 2𝑑𝑦 1 𝑦 2 2𝑑𝑦 2 𝑑 First layer is fixed. If also learn first layer, it becomes two layer neural network

Recommend


More recommend