learning from data lecture 25 the kernel trick
play

Learning From Data Lecture 25 The Kernel Trick Learning with only - PowerPoint PPT Presentation

Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M. Magdon-Ismail CSCI 4100/6100 recap: Large Margin is Better Controling Overfitting Non-Separable Data 0.08 random hyperplane 2 w t w + C N E


  1. Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M. Magdon-Ismail CSCI 4100/6100

  2. recap: Large Margin is Better Controling Overfitting Non-Separable Data 0.08 random hyperplane 2 w t w + C � N E out 1 minimize n =1 ξ n 0.06 b, w , ξ subject to: y n ( w t x n + b ) ≥ 1 − ξ n ξ n ≥ 0 for n = 1 , . . . , N SVM 0.04 0 0 . 25 0 . 5 0 . 75 1 γ (random hyperplane) /γ (SVM) � R 2 � Theorem. d vc ( γ ) ≤ + 1 γ 2 Φ 2 + SVM Φ 3 + SVM Φ 3 + pseudoinverse algorithm E cv ≤ # support vectors N Complex hypothesis that does not overfit because it is ‘simple’, controlled by only a few support vectors. M Kernel Trick : 2 /18 � A c L Creator: Malik Magdon-Ismail Mechanics of the nonlinear transform − →

  3. Recall: Mechanics of the Nonlinear Transform ˜ X -space is R d d Z -space is R       1 1 1 Φ − → x 1 Φ 1 ( x ) z 1       x = z = Φ ( x ) =  =  .   .   .  . . .  .   .  .  x d Φ ˜ d ( x ) z ˜ d x 1 , x 2 , . . . , x N z 1 , z 2 , . . . , z N 1. Original data 2. Transform the data y 1 , y 2 , . . . , y N y 1 , y 2 , . . . , y N x n ∈ X z n = Φ( x n ) ∈ Z   w 0 ↓ w 1   no weights w = ˜ .   .  .  w ˜ d d vc = d + 1 d vc = d + 1 ‘ Φ − 1 ’ g ( x ) = sign( ˜ w t Φ ( x )) ← − 4. Classify in X -space 3. Separate data in Z -space Have to transform the data to the Z -space. g ( x ) = ˜ g (Φ( x )) = sign( ˜ w t Φ( x )) ˜ g ( z ) = sign( ˜ w t z ) M Kernel Trick : 3 /18 � A c L Creator: Malik Magdon-Ismail Topic for this lecture − →

  4. This Lecture How to use nonlinear transforms without physically transforming data to Z -space. M Kernel Trick : 4 /18 � A c L Creator: Malik Magdon-Ismail Primal versus dual − →

  5. Primal Versus Dual Primal Dual N N � � 1 minimize α n α m y n y m ( x t n x m ) − α n 1 2 minimize 2 w t w α n,m =1 n =1 b, w subject to: y n ( w t x n + b ) ≥ 1 for n = 1 , . . . , N N � subject to: α n y n = 0 n =1 α n ≥ 0 for n = 1 , . . ., N N � w ∗ = α ∗ n y n x n support vectors n =1 ւ b ∗ = y s − w t x s ( α ∗ s > 0) g ( x ) = sign( w ∗ t x + b ∗ ) g ( x ) = sign( w t x + b ) � N � � α ∗ = sign n y n x t n ( x − x s ) + y s n =1 d + 1 optimization variables w , b N optimization variables α M Kernel Trick : 5 /18 � A c L Creator: Malik Magdon-Ismail Vector-matrix form − →

  6. Primal Versus Dual - Matrix Vector Form Primal Dual 1 minimize 2 α t G α − 1 t α (G nm = y n y m x t n x m ) 1 minimize 2 w t w α b, w subject to: y n ( w t x n + b ) ≥ 1 for n = 1 , . . . , N subject to: y t α = 0 α ≥ 0 N � w ∗ = α ∗ n y n x n support vectors n =1 ւ b ∗ = y s − w t x s ( α ∗ s > 0) g ( x ) = sign( w ∗ t x + b ∗ ) � N g ( x ) = sign( w t x + b ) � � α ∗ = sign n y n x t n ( x − x s ) + y s n =1 d + 1 optimization variables w , b N optimization variables α M Kernel Trick : 6 /18 � A c L Creator: Malik Magdon-Ismail The Lagrangian − →

  7. Deriving the Dual: The Lagrangian N L = 1 � 2 w t w + α n · (1 − y n ( w t x n + b )) n =1 ↑ ↑ lagrange the constraints multipliers minimize w.r.t. b, w ← unconstrained maximize w.r.t. α ≥ 0 Intuition • 1 − y n ( w t x n + b ) > 0 = ⇒ α n → ∞ gives L → ∞ • Choose ( b, w ) to min L , so 1 − y n ( w t x n + b ) ≤ 0 • 1 − y n ( w t x n + b ) < 0 = ⇒ α n = 0 (max L w.r.t. α n ) ↑ non support vectors Conclusion Formally: use KKT conditions to transform the primal. At the optimum, α n ( y n ( w t x n + b ) − 1) = 0, so L = 1 2 w t w is minimized and the constraints are satisfied 1 − y n ( w t x n + b ) ≤ 0 M Kernel Trick : 7 /18 � A c L Creator: Malik Magdon-Ismail − →

  8. Unconstrained Minimization w.r.t. ( b, w ) N L = 1 � 2 w t w − α n · ( y n ( w t x n + b ) − 1) n =1 Set ∂ L ∂b = 0: N N ∂ L � � ∂b = α n y n = ⇒ α n y n = 0 n =1 n =1 Set ∂ L ∂ w = 0: N N ∂ L � � ∂ w = w − α n y n x n = ⇒ w = α n y n x n n =1 n =1 1 Substitute into L to maximize w.r.t. α ≥ 0 minimize 2 α t G α − 1 t α (G nm = y n y m x t n x m ) α N N N 1 � � � y t α = 0 subject to: 2 w t w − w t L = α n y n x n − b α n y n + α n n =1 n =1 n =1 α ≥ 0 N − 1 � = 2 w t w + α n w = � N n =1 α ∗ n y n x n n =1 N N − 1 � � α s > 0 = ⇒ y s ( w t x s + b ) − 1 = 0 α n α m y n y m x t = n x m + α n 2 m,n =1 n =1 = ⇒ b = y s − w t x s M Kernel Trick : 8 /18 � A c L Creator: Malik Magdon-Ismail Example − →

  9. Example — Our Toy Data Set signed data matrix ↓         0 0 − 1 0 0 0 0 0 0 2 2 − 1 − 2 − 2 0 8 − 4 − 6         X = y = − → X s = − → G = X s X t s =         2 0 +1 2 0 0 − 4 4 6         3 0 +1 3 0 0 − 6 6 9 Quadratic Programming Dual SVM 1 1 minimize 2 u t Q u + p t z minimize 2 α t G α − 1 t α α u subject to: A u ≥ c subject to: y t α = 0 α = 1 α ≥ 0 2   1  2 u = α  1    x 1 − x 2 − 1 = 0 α ∗ =    2 Q = G       1     p = − 1 N   0      α = 1 y t   4 � � 2 QP (Q , p , A , c ) 1 � − − − − − − − → A = − y t α ∗ w = n y n x n =   − 1 α = 1 α = 0  I N   n =1        0   b = y 1 − w t x 1 = − 1   0 c =      non-support vectors = ⇒ α n = 0  0 N 1 1 γ = | = √ only support vectors can have α n > 0 | | w | 2 M Kernel Trick : 9 /18 � A c L Creator: Malik Magdon-Ismail Dual linear-SVM QP algorithm − →

  10. Dual QP Algorithm for Hard Margin linear-SVM 1: Input: X , y . 2: Let p = − 1 N be the N -vector of ones and c = 0 N +2 the N -vector of zeros. Construct matrices Q and A, where     — y 1 x t 1 — y t 1 minimize 2 α t G α − 1 t α . . α X s = , Q = X s X t s , A = − y t .     y t α = 0 subject to: I N × N — y N x t N — α ≥ 0 � �� � signed data matrix ↑ Some packages allow equality 3: α ∗ ← QP (Q , c , A , a ) . and bound constraints to directly solve this type of QP 4: Return � w ∗ = α ∗ n y n x n n > 0 α ∗ b ∗ = y s − w t x s ( α ∗ s > 0) 5: The final hypothesis is g ( x ) = sign( w ∗ t x + b ∗ ). M Kernel Trick : 10 /18 � A c L Creator: Malik Magdon-Ismail Primal versus dual (non-separable) − →

  11. Primal Versus Dual (Non-Separable) Primal Dual 2 w t w + C � N 1 1 minimize 2 α t G α − 1 t α minimize n =1 ξ n α b, w , ξ subject to: y t α = 0 subject to: y n ( w t x n + b ) ≥ 1 − ξ n ξ n ≥ 0 for n = 1 , . . . , N C ≥ α ≥ 0 N � w ∗ = α ∗ n y n x n n =1 b ∗ = y s − w t x s ( C > α ∗ s > 0) g ( x ) = sign( w ∗ t x + b ∗ ) g ( x ) = sign( w t x + b ) � N � � α ∗ = sign n y n x t n ( x − x s ) + y s n =1 N + d + 1 optimization variables b, w , ξ N optimization variables α M Kernel Trick : 11 /18 � A c L Creator: Malik Magdon-Ismail Inner product algorithm − →

  12. Dual SVM is an Inner Product Algorithm X -Space 1 minimize 2 α t G α − 1 t α α subject to: y t α = 0 C ≥ α ≥ 0 G nm = y n y m ( x t n x m )   C > α ∗ s > 0 � α ∗ n x ) + b ∗ g ( x ) = sign n y n ( x t  � b ∗ = y s − α ∗ n y n ( x t n x s ) α ∗ n > 0 α ∗ n > 0 Can compute z t z ′ without needing z = Φ( x ) to visit Z -space? M Kernel Trick : 12 /18 � A c L Creator: Malik Magdon-Ismail Z -space inner product algorithm − →

  13. Dual SVM is an Inner Product Algorithm Z -Space 1 minimize 2 α t G α − 1 t α α subject to: y t α = 0 C ≥ α ≥ 0 G nm = y n y m ( z t n z m )   C > α ∗ s > 0 � α ∗ n z ) + b ∗ g ( x ) = sign n y n ( z t  � b ∗ = y s − α ∗ n y n ( z t n z s ) α ∗ n > 0 α ∗ n > 0 Can we compute z t z ′ without needing z = Φ( x ) to visit Z -space? Can we compute z t z ′ efficiently − M Kernel Trick : 13 /18 � A c L Creator: Malik Magdon-Ismail →

  14. Dual SVM is an Inner Product Algorithm Z -Space 1 minimize 2 α t G α − 1 t α α subject to: y t α = 0 C ≥ α ≥ 0 G nm = y n y m ( z t n z m )   C > α ∗ s > 0 � α ∗ g ( x ) = sign n y n ( z t n z ) + b  � α ∗ b = y s − n y n ( z t n z s ) α ∗ n > 0 α ∗ n > 0 Can we compute z t z ′ without needing z = Φ( x ) to visit Z -space? M Kernel Trick : 14 /18 � A c L Creator: Malik Magdon-Ismail The Kernel − →

Recommend


More recommend