bbm406
play

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick - PowerPoint PPT Presentation

Photo by Arthur Gretton, CMU Machine Learning Protestors at G20 BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss Support Vector Regression Aykut Erdem // Hacettepe University // Fall 2019


  1. Photo by Arthur Gretton, CMU Machine Learning Protestors at G20 BBM406 Fundamentals of 
 Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss Support Vector Regression Aykut Erdem // Hacettepe University // Fall 2019

  2. Administrative Deadlines are much closer than they appear • Project progress reports are 
 on syllabus due soon! 
 Due: December 22, 2019 (11:59pm) Each group should submit a project progress report by December 22, 2018. The report should be 3-4 pages and should describe the following points as clearly as possible: • Problem to be addressed. Give a short description of the problem that you will explore. Explain why you find it interesting. • Related work. Briefly review the major works related to your research topic. • Methodology to be employed. Describe the neural architecture that is expected to form the basis of the project. State whether you will extend an existing method or you are going to devise your own approach. • Experimental evaluation. Briefly explain how you will evaluate your results. State which dataset(s) you will employ in your evaluation. Provide your preliminary results (if any). 2

  3. Last time… Soft-margin Classifier h w, x i + b � 1 h w, x i + b  � 1 minimum error separator is impossible Theorem (Minsky & Papert) 
 slide by Alex Smola Finding the minimum error separating hyperplane is NP hard

  4. Last time… Adding Slack Variables ξ i ≥ 0 h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ minimize amount of slack slide by Alex Smola Convex optimization problem

  5. Last time… Adding Slack Variables • for point is between the margin and correctly 0 < ξ ≤ 1 classified • for point is misclassified ξ i ≥ 0 h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ adopted from Andrew Zisserman minimize amount of slack Convex optimization problem

  6. 
 
 
 Last time… Adding Slack Variables • Hard margin problem 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b • With slack variables 
 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 Problem is always feasible. Proof: 
 (also yields upper bound) w = 0 and b = 0 and ξ i = 1 slide by Alex Smola

  7. 
 
 
 Soft-margin classifier • Optimization problem: 
 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 C is a regularization parameter: • small C allows constraints to be easily ignored 
 → large margin • large C makes constraints hard to ignore 
 adopted from Andrew Zisserman → narrow margin • C = ∞ enforces all constraints: hard margin

  8. Last time… Multi-class SVM • Simultaneously-learn-3-sets-- w + of-weights:-- w - • How-do-we-guarantee-the-- correct-labels?-- w o • Need-new-constraints!-- The-“score”-of-the-correct-- class-must-be-be?er-than- the-“score”-of-wrong-classes:-- slide by Eric Xing 8

  9. Last time… Multi-class SVM • As#for#the#SVM,#we#introduce#slack#variables#and#maximize#margin:## To predict, we use: Now#can#we#learn#it?### ?  slide by Eric Xing 9

  10. Last time… Kernels • Original data • Data in feature space (implicit) • Solve in feature space using kernels slide by Alex Smola 10

  11. Last time… Quadratic Features Quadratic Features in Quadratic Features in R 2 p ⇣ ⌘ x 2 2 x 1 x 2 , x 2 Φ ( x ) := 1 , 2 Dot Product Dot Product p p D⇣ ⌘ ⇣ 2 ⌘E 2 , h Φ ( x ) , Φ ( x 0 ) i = x 0 2 x 0 1 x 0 2 , x 0 x 2 2 x 1 x 2 , x 2 1 , , 2 1 2 = h x, x 0 i 2 . Insight Insight Trick works for any polynomials of order via Trick works for any polynomials of order d via h x, x 0 i d . slide by Alex Smola 11

  12. Computational E ffi ciency Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- 5 · 10 5 mial features much worse. Solution Solution Don’t compute the features, try to compute dot products implicitly. For some features this works . . . Definition Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k ( x, x 0 ) = h Φ ( x ) , Φ ( x 0 ) i for some feature map Φ . slide by Alex Smola If k ( x, x 0 ) is much cheaper to compute than Φ ( x ) . . . 12

  13. Last time.. Example kernels Examples of kernels k ( x, x 0 ) h x, x 0 i Linear exp ( � λ k x � x 0 k ) Laplacian RBF � λ k x � x 0 k 2 � � Gaussian RBF exp ( h x, x 0 i + c i ) d , c � 0 , d 2 N Polynomial B 2 n +1 ( x � x 0 ) B-Spline E c [ p ( x | c ) p ( x 0 | c )] Cond. Expectation Simple trick for checking Mercer’s condition Compute the Fourier transform of the kernel and check that it is nonnegative. slide by Alex Smola 13

  14. Today • The Kernel Trick for SVMs • Risk and Loss • Support Vector Regression 14

  15. The Kernel Trick for SVMs slide by Alex Smola

  16. 
 
 The Kernel Trick for SVMs • Linear soft margin problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 • Dual problem 
 � 1 X X maximize α i α j y i y j h x i , x j i + α i 2 α i,j i X subject to α i y i = 0 and α i 2 [0 , C ] i • Support vector expansion slide by Alex Smola X f ( x ) = α i y i h x i , x i + b i

  17. 
 
 The Kernel Trick for SVMs • Linear soft margin problem 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, φ ( x i ) i + b ] � 1 � ξ i and ξ i � 0 • Dual problem 
 − 1 X X maximize α i α j y i y j k ( x i , x j ) + α i 2 α i,j i X subject to α i y i = 0 and α i ∈ [0 , C ] i • Support vector expansion slide by Alex Smola X f ( x ) = α i y i k ( x i , x ) + b i

  18. C=1 slide by Alex Smola

  19. C=1 y = 1 support 
 vectors y = -1 y=0 support 
 vectors slide by Alex Smola

  20. C=2 slide by Alex Smola

  21. C=5 slide by Alex Smola

  22. C=10 slide by Alex Smola

  23. C=20 slide by Alex Smola

  24. C=50 slide by Alex Smola

  25. C=100 slide by Alex Smola

  26. C=1 slide by Alex Smola

  27. C=2 slide by Alex Smola

  28. C=5 slide by Alex Smola

  29. C=10 slide by Alex Smola

  30. C=20 slide by Alex Smola

  31. C=50 slide by Alex Smola

  32. C=100 slide by Alex Smola

  33. C=1 slide by Alex Smola

  34. C=2 slide by Alex Smola

  35. C=5 slide by Alex Smola

  36. C=10 slide by Alex Smola

  37. C=20 slide by Alex Smola

  38. C=50 slide by Alex Smola

  39. C=100 slide by Alex Smola

  40. C=1 slide by Alex Smola

  41. C=2 slide by Alex Smola

  42. C=5 slide by Alex Smola

  43. C=10 slide by Alex Smola

  44. C=20 slide by Alex Smola

  45. C=50 slide by Alex Smola

  46. C=100 slide by Alex Smola

  47. And now with a narrower kernel slide by Alex Smola

  48. slide by Alex Smola

  49. slide by Alex Smola

  50. slide by Alex Smola

  51. slide by Alex Smola

  52. And now with a very wide kernel slide by Alex Smola

  53. slide by Alex Smola

  54. Nonlinear Separation • Increasing C allows for more nonlinearities • Decreases number of errors • SV boundary need not be contiguous slide by Alex Smola • Kernel width adjusts function class

  55. Overfitting? • Huge feature space with kernels: should we worry about overfitting? 
 • SVM objective seeks a solution with large margin - Theory says that large margin leads to good generalization (we will see this in a couple of lectures) 
 • But everything overfits sometimes!!! 
 • Can control by: - Setting C - Choosing a better Kernel - Varying parameters of the Kernel (width of Gaussian, slide by Alex Smola etc.) 55

  56. 56 Risk and Loss slide by Alex Smola

  57. 
 
 
 
 
 Loss function point of view • Constrained quadratic program 
 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 • Risk minimization setting 
 1 2 k w k 2 + C X minimize max [0 , 1 � y i [ h w, x i i + b ]] w,b i empirical risk Follows from finding minimal slack variable slide by Alex Smola for given ( w,b ) pair.

  58. Soft margin as proxy for binary • Soft margin loss max(0 , 1 − yf ( x )) • Binary loss { yf ( x ) < 0 } convex upper bound binary loss function margin slide by Alex Smola

  59. More loss functions h 1 + e − f ( x ) i • Logistic log • Huberized loss  0 if f ( x ) > 1   2 (1 − f ( x )) 2 1 if f ( x ) ∈ [0 , 1] 1  2 − f ( x ) if f ( x ) < 0  • Soft margin (asymptotically) linear max(0 , 1 − f ( x )) (asymptotically) 0 slide by Alex Smola

  60. 
 
 
 Risk minimization view • Find function f minimizing classification error R [ f ] := E x,y ∼ p ( x,y ) [ { yf ( x ) > 0 } ] • Compute empirical average 
 m R emp [ f ] := 1 X { y i f ( x i ) > 0 } m i =1 − Minimization is nonconvex − Overfitting as we minimize empirical error • Compute convex upper bound on the loss • Add regularization for capacity control 
 regularization m R reg [ f ] := 1 X max(0 , 1 − y i f ( x i )) + λ Ω [ f ] slide by Alex Smola m i =1 how to control ƛ

  61. Support Vector 
 Regression 61

Recommend


More recommend