introduction to machine learning cs725 instructor prof
play

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 13 - KKT Conditions, Duality, SVR Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


  1. Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 13 - KKT Conditions, Duality, SVR Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  2. KKT conditions for SVR m L ( w , α, α ∗ , µ, µ ∗ ) = 1 2 ∥ w ∥ 2 + C ( ) ∑ ∑ ( ξ i + ξ ∗ y i − w ⊤ φ ( x i ) − b − ϵ − ξ i i ) + + α i i =1 i m m m ( ) ∑ α ∗ b + w ⊤ φ ( x i ) − y i − ϵ − ξ ∗ ∑ ∑ µ ∗ i ξ ∗ − µ i ξ i − i i i i =1 i =1 i =1 Differentiating the Lagrangian w.r.t. w , m ∑ w − α i φ ( x i ) + α ∗ ( α i − α ∗ i φ ( x i ) = 0 i.e. , w = i ) φ ( x i ) i =1 Differentiating the Lagrangian w.r.t. ξ i , C − α i − µ i = 0 i.e. , α i + µ i = C Differentiating the Lagrangian w.r.t ξ ∗ i , α ∗ i + µ ∗ i = C Differentiating the Lagrangian w.r.t b , i ( α ∗ ∑ i − α i ) = 0 Complimentary slackness: α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) = 0 AND µ i ξ i = 0 AND α ∗ i ( b + w ⊤ φ ( x i ) − y i − ϵ − ξ ∗ i ) = 0 AND µ ∗ i ξ ∗ i = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  3. For Support Vector Regression, since the original objective and the constraints are convex, any ( w , b , α, α ∗ , µ, µ ∗ , ξ, ξ ∗ ) that satisfy the necessary KKT conditions gives optimality (conditions are also sufficient) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  4. Some observations α i , α ∗ i ≥ 0, µ i , µ ∗ i ≥ 0, α i + µ i = C and α ∗ i + µ ∗ i = C Thus, α i , µ i , α ∗ i , µ ∗ i ∈ [0 , C ], ∀ i If 0 < α i < C , then 0 < µ i < C (as α i + µ i = C ) µ i ξ i = 0 and α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) = 0 are complementary slackness conditions So 0 < α i < C ⇒ ξ i = 0 and y i − w ⊤ φ ( x i ) − b = ϵ + ξ i = ϵ All such points lie on the boundary of the ϵ band Using any point x j (that is with α j ∈ (0 , C )) on margin, we can recover b as: b = y j − w ⊤ φ ( x j ) − ϵ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  5. Support Vector Regression Dual Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  6. Weak Duality L ∗ ( α, α ∗ , µ, µ ∗ ) = w , b ,ξ,ξ ∗ L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) min By weak duality theorem, we have: 2 ∥ w ∥ 2 + C ∑ m 1 i =1 ( ξ i + ξ ∗ i ) ≥ L ∗ ( α, α ∗ , µ, µ ∗ ) min w , b ,ξ,ξ ∗ s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n The above is true for any α i , α ∗ i ≥ 0 and µ i , µ ∗ i ≥ 0 Thus, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  7. Weak Duality L ∗ ( α, α ∗ , µ, µ ∗ ) = w , b ,ξ,ξ ∗ L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) min By weak duality theorem, we have: 2 ∥ w ∥ 2 + C ∑ m 1 i =1 ( ξ i + ξ ∗ i ) ≥ L ∗ ( α, α ∗ , µ, µ ∗ ) min w , b ,ξ,ξ ∗ s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n The above is true for any α i , α ∗ i ≥ 0 and µ i , µ ∗ i ≥ 0 Thus, m 1 2 ∥ w ∥ 2 + C ∑ ( ξ i + ξ ∗ α,α ∗ ,µ,µ ∗ L ∗ ( α, α ∗ , µ, µ ∗ ) min i ) ≥ max w , b ,ξ,ξ ∗ i =1 s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  8. Dual objective L ∗ ( α, α ∗ , µ, µ ∗ ) = w , b ,ξ,ξ ∗ L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) min Assume: In case of SVR, we have a strictly convex objective and linear constraints ⇒ KKT conditions are necessary and sufficient and strong duality holds: m 1 2 ∥ w ∥ 2 + C ∑ ( ξ i + ξ ∗ α,α ∗ ,µ,µ ∗ L ∗ ( α, α ∗ , µ, µ ∗ ) min i ) = max w , b ,ξ,ξ ∗ i =1 s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n This value is precisely obtained at the ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) that satisfies the necessary (and sufficient) KKT optimality conditions Given strong duality, we can equivalently solve α,α ∗ ,µ,µ ∗ L ∗ ( α, α ∗ , µ, µ ∗ ) max . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  9. 2 ∥ w ∥ 2 + C ∑ m L ( α, α ∗ , µ, µ ∗ ) = 1 i =1 ( ξ i + ξ ∗ i ) + m α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) + α ∗ i ( w ⊤ φ ( x i ) + b − y i − ϵ − ξ ∗ ( ∑ i ) i =1 m ( µ i ξ i + µ ∗ i ξ ∗ ∑ i ) i =1 i in terms of α , α ∗ , µ and µ ∗ by using We obtain w , b , ξ i , ξ ∗ m ( α i − α ∗ the KKT conditions derived earlier as w = ∑ i ) φ ( x i ) i =1 m ( α i − α ∗ i ) = 0 and α i + µ i = C and α ∗ i + µ ∗ and ∑ i = C i =1 Thus, we get: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  10. 2 ∥ w ∥ 2 + C ∑ m L ( α, α ∗ , µ, µ ∗ ) = 1 i =1 ( ξ i + ξ ∗ i ) + m α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) + α ∗ i ( w ⊤ φ ( x i ) + b − y i − ϵ − ξ ∗ ( ∑ i ) i =1 m ( µ i ξ i + µ ∗ i ξ ∗ ∑ i ) i =1 i in terms of α , α ∗ , µ and µ ∗ by using We obtain w , b , ξ i , ξ ∗ m ( α i − α ∗ the KKT conditions derived earlier as w = ∑ i ) φ ( x i ) i =1 m ( α i − α ∗ i ) = 0 and α i + µ i = C and α ∗ i + µ ∗ and ∑ i = C i =1 Thus, we get: L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) = 1 j ( α i − α ∗ i )( α j − α ∗ j ) φ ⊤ ( x i ) φ ( x j ) + ∑ ∑ 2 i i ( ξ i ( C − α i − µ i ) + ξ ∗ i ( C − α ∗ i − µ ∗ i ( α i − α ∗ ∑ i )) − b ∑ i ) − i ( α i + α ∗ i y i ( α i − α ∗ j ( α i − α ∗ ϵ ∑ i ) + ∑ i ) − ∑ ∑ i )( α j − i α ∗ j ) φ ⊤ ( x i ) φ ( x j ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  11. 2 ∥ w ∥ 2 + C ∑ m L ( α, α ∗ , µ, µ ∗ ) = 1 i =1 ( ξ i + ξ ∗ i ) + m α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) + α ∗ i ( w ⊤ φ ( x i ) + b − y i − ϵ − ξ ∗ ( ∑ i ) i =1 m ( µ i ξ i + µ ∗ i ξ ∗ ∑ i ) i =1 i in terms of α , α ∗ , µ and µ ∗ by using We obtain w , b , ξ i , ξ ∗ m ( α i − α ∗ the KKT conditions derived earlier as w = ∑ i ) φ ( x i ) i =1 m ( α i − α ∗ i ) = 0 and α i + µ i = C and α ∗ i + µ ∗ and ∑ i = C i =1 Thus, we get: L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) = 1 j ( α i − α ∗ i )( α j − α ∗ j ) φ ⊤ ( x i ) φ ( x j ) + ∑ ∑ 2 i i ( ξ i ( C − α i − µ i ) + ξ ∗ i ( C − α ∗ i − µ ∗ i ( α i − α ∗ ∑ i )) − b ∑ i ) − i ( α i + α ∗ i y i ( α i − α ∗ j ( α i − α ∗ ϵ ∑ i ) + ∑ i ) − ∑ ∑ i )( α j − i α ∗ j ) φ ⊤ ( x i ) φ ( x j ) = − 1 j ( α i − α ∗ i )( α j − α ∗ j ) φ ⊤ ( x i ) φ ( x j ) − ϵ ∑ ∑ ∑ i ( α i + 2 i α ∗ i y i ( α i − α ∗ i ) + ∑ i ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  12. Kernel function: K ( x i , x j ) = φ T ( x i ) φ ( x j ) w = ∑ m i =1 ( α i − α ∗ i ) φ ( x i ) ⇒ the final decision function f ( x ) = w T φ ( x ) + b = ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x )+ y j − ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x j ) − ϵ x j is any point with α j ∈ (0 , C ). Recall similarity with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  13. Kernel function: K ( x i , x j ) = φ T ( x i ) φ ( x j ) w = ∑ m i =1 ( α i − α ∗ i ) φ ( x i ) ⇒ the final decision function f ( x ) = w T φ ( x ) + b = ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x )+ y j − ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x j ) − ϵ x j is any point with α j ∈ (0 , C ). Recall similarity with kernelized expression for Ridge Regression The dual optimization problem to compute the α ’s for SVR is: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recommend


More recommend