support vector machines kernels lecture 5
play

Support Vector Machines & Kernels Lecture 5 David Sontag New - PowerPoint PPT Presentation

Support Vector Machines & Kernels Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer and Carlos Guestrin Support Vector Machines QP form: More natural form: Equivalent if Regularization Empirical loss


  1. Support Vector Machines & Kernels Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer and Carlos Guestrin

  2. Support Vector Machines QP form: More “natural” form: Equivalent if Regularization Empirical loss term

  3. Subgradient method

  4. Subgradient method Step size:

  5. Stochastic subgradient 1 Subgradient

  6. PEGASOS A_t = S |A_t| = 1 Subgradient method Stochastic gradient 1 Subgradient Projection

  7. Run-Time of Pegasos • Choosing |A t |=1  Run-time required for Pegasos to find ε accurate solution w.p. ¸ 1- δ n = # of features • Run-time does not depend on #examples • Depends on “difficulty” of problem ( λ and ε )

  8. Experiments • 3 datasets (provided by Joachims) – Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features) – Covertype (581k examples, 54 features) Pegasos SVM-Perf SVM-Light 2 77 20,075 Reuters Training Time (in seconds): 6 85 25,514 Covertype 2 5 80 Astro-Physics

  9. What’s Next! • Learn one of the most interesting and exciting recent advancements in machine learning – The “kernel trick” – High dimensional feature spaces at no extra cost • But first, a detour – Constrained optimization!

  10. Constrained optimization x ≥ -1 No Constraint x ≥ 1 x*=0 x*=0 x*=1 How do we solve with constraints?  Lagrange Multipliers!!!

  11. Lagrange multipliers – Dual variables Add Lagrange multiplier Rewrite Constraint Introduce Lagrangian (objective): We will solve: Why is this equivalent? • min is fighting max! x<b  (x-b)<0  max α - α (x-b) = ∞ • min won’t let this happen! Add new constraint x>b, α ≥ 0  (x-b)>0  max α - α (x-b) = 0, α *=0 • min is cool with 0, and L(x, α )=x 2 (original objective) x=b  α can be anything, and L(x, α )=x 2 (original objective) The min on the outside forces max to behave, so constraints will be satisfied.

  12. Dual SVM derivation (1) – the linearly separable case (hard margin SVM) Original optimization problem: One Lagrange multiplier Rewrite per example constraints Lagrangian: Our goal now is to solve:

  13. Dual SVM derivation (2) – the linearly separable case (hard margin SVM) (Primal) Swap min and max (Dual) Slater’s condition from convex optimization guarantees that these two optimization problems are equivalent!

  14. Dual SVM derivation (3) – the linearly separable case (hard margin SVM) (Dual) ⇤ ⌅ Can solve for optimal w , b as function of α : ∂ L ⌥  ∂ w = w − α j y j x j j  Substituting these values back in (and simplifying), we obtain: (Dual) Sums over all training examples scalars dot product

  15. Reminder: What if the data is not linearly separable? Use features of features of features of features…. x (1)   . . .     x ( n )     x (1) x (2)   φ ( x ) =   x (1) x (3)       . . .   e x (1)     . . . Feature space can get really large really quickly!

  16. Higher order polynomials number of monomial terms d=4 m – input features d – degree of polynomial d=3 grows fast! d = 6, m = 100 d=2 about 1.6 billion terms number of input dimensions

  17. Dual formulation only depends on dot-products of the features! First, we introduce a feature mapping :  Next, replace the dot product with an equivalent kernel function:

  18. Polynomial kernel d =1 � u 1 � v 1 � ⇥ ⇥ � ⇥ ⇥ φ ( u ) . φ ( v ) = = u 1 v 1 + u 2 v 2 = u.v . u 2 v 2 u 2 v 2 ⇤ ⌅ ⇤ ⌅ d =2 ⌃ ⇧ ⌃ 1 1 u 1 u 2 v 1 v 2 ⌃ = u 2 1 v 2 1 + 2 u 1 v 1 u 2 v 2 + u 2 2 v 2 ⌥ � ⌥ � φ ( u ) . φ ( v ) = ⌃ . ⌥ � ⌥ � 2 u 2 u 1 v 2 v 1 ⇧ ⇧ = ( u 1 v 1 + u 2 v 2 ) 2 u 2 v 2 2 2 = ( u.v ) 2 For any d (we will skip proof): φ ( u ) . φ ( v ) = ( u.v ) d Polynomials of degree exactly d

  19. Common kernels • Polynomials of degree exactly d • Polynomials of degree up to d • Gaussian kernels • Sigmoid • And many others: very active area of research!

More recommend