lecture 26 support vector classifjcation unsupervised
play

Lecture 26: Support Vector Classifjcation, Unsupervised Learning - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . . Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh Ramakrishnan October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 1 / 28


  1. . . . . . . . . . . . . . . . . . Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh Ramakrishnan October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 1 / 28

  2. . . . . . . . . . . . . . . . . . . Support Vector Classifjcation October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . 2 / 28

  3. . . . . . . . . . . . . . . . Perceptron does not fjnd the best seperating hyperplane, it fjnds any seperating hyperplane. In case the initial w does not classify all the examples, the seperating hyperplane The seperating hyperplane does not provide enough breathing space – this is what SVMs address and we already saw that for regression! We now quickly do the same for classifjcation October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . 3 / 28 corresponding to the fjnal w ∗ will often pass through an example.

  4. . . . . . . . . . . . . . . . . Perceptron does not fjnd the best seperating hyperplane, it fjnds any seperating hyperplane. In case the initial w does not classify all the examples, the seperating hyperplane The seperating hyperplane does not provide enough breathing space – this is what SVMs address and we already saw that for regression! October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . . 3 / 28 corresponding to the fjnal w ∗ will often pass through an example. ▶ We now quickly do the same for classifjcation

  5. . . . . . . . . . . . . . . . . . Support Vector Classifjcation: Separable Case R m There is large margin to seperate the +ve and -ve examples October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 4 / 28 w ⊤ φ ( x ) + b ≥ +1 for y = +1 w ⊤ φ ( x ) + b ≤ − 1 for y = − 1 w , φ ∈ I

  6. x i i ( for y i x i i ( for y i Multiplying y i on both sides, we get: y i w x i fjed point is from the seperating hyperplane): When the examples are not linearly seperable, Support Vector Classifjcation: Non-separable Case . . . . . . . w ) b . w b ) b i , i n October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 / 28 we need to consider the slackness ξ i (always +ve) of each example x ( i ) (how far a misclassi-

  7. . . . . . . . . . . . . . . . . . Support Vector Classifjcation: Non-separable Case When the examples are not linearly seperable, fjed point is from the seperating hyperplane): October 27, 2016 . . . . . . . . . . . . . . . . . . 5 / 28 . . . . . we need to consider the slackness ξ i (always +ve) of each example x ( i ) (how far a misclassi- w ⊤ φ ( x ( i ) ) + b ≥ +1 − ξ i ( for y ( i ) = +1 ) w ⊤ φ ( x ( i ) ) + b ≤ − 1 + ξ i ( for y ( i ) = − 1 ) Multiplying y ( i ) on both sides, we get: y ( i ) ( w ⊤ φ ( x ( i ) ) + b ) ≥ 1 − ξ i , ∀ i = 1 , . . . , n

  8. . . . . . . . . . . . . . . . . . Maximize the margin Recall that w is perpendicular to the separating surface concerned with the direction of w and not its magnitude October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 6 / 28 We maximize the margin ( φ ( x + ) − φ ( x − )) ⊤ [ w ∥ w ∥ ] Here, x + and x − lie on boundaries of the margin. We project the vectors φ ( x + ) and φ ( x − ) on w , and normalize by w as we are only

  9. . . . . . . . . . . . . . . . Simplifying the margin expression Adding 2 to 1 , w x x Thus, the margin expression to maximize is: w October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . 7 / 28 Maximize the margin ( φ ( x + ) − φ ( x − )) ⊤ [ w ∥ w ∥ ] At x + : y + = 1 , ξ + = 0 hence, ( w ⊤ φ ( x + ) + b ) = 1 — 1 At x − : y − = 1 , ξ − = 0 hence, − ( w ⊤ φ ( x − ) + b ) = 1 — 2

  10. . . . . . . . . . . . . . . . . . Simplifying the margin expression Adding 2 to 1 , Thus, the margin expression to maximize is: October 27, 2016 . . . . . . . . . . . . . . . . . . 7 / 28 . . . . . Maximize the margin ( φ ( x + ) − φ ( x − )) ⊤ [ w ∥ w ∥ ] At x + : y + = 1 , ξ + = 0 hence, ( w ⊤ φ ( x + ) + b ) = 1 — 1 At x − : y − = 1 , ξ − = 0 hence, − ( w ⊤ φ ( x − ) + b ) = 1 — 2 w ⊤ ( φ ( x + ) − φ ( x − )) = 2 2 ∥ w ∥

  11. . . . . . . . . . . . . . . . Formulating the objective Thus, with arbitrarily large values of i , the constraints become easily satisfjable for any w , which defeats the purpose. Hence, we also want to minimize the i ’s. E.g. , minimize i October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . 8 / 28 . . . Problem at hand: Find w ∗ , b ∗ that maximize the margin. ( w ∗ , b ∗ ) = arg max w , b 2 ∥ w ∥ s.t. y ( i ) ( w ⊤ φ ( x ( i ) ) + b ) ≥ 1 − ξ i and ξ i ≥ 0 , ∀ i = 1 , . . . , n However, as ξ i → ∞ , 1 − ξ i → −∞

  12. . . . . . . . . . . . . . . . . . Formulating the objective w , which defeats the purpose. October 27, 2016 . . . . . . . . . . . . . . . . . . 8 / 28 . . . . . Problem at hand: Find w ∗ , b ∗ that maximize the margin. ( w ∗ , b ∗ ) = arg max w , b 2 ∥ w ∥ s.t. y ( i ) ( w ⊤ φ ( x ( i ) ) + b ) ≥ 1 − ξ i and ξ i ≥ 0 , ∀ i = 1 , . . . , n However, as ξ i → ∞ , 1 − ξ i → −∞ Thus, with arbitrarily large values of ξ i , the constraints become easily satisfjable for any Hence, we also want to minimize the ξ i ’s. E.g. , minimize ∑ ξ i

  13. . . . . . . . . . . . . . . . . . Objective n Instead of maximizing October 27, 2016 . . . . . . . . . . . . . 9 / 28 . . . . . . . . . . 1 2 ∥ w ∥ 2 + C ∑ ( w ∗ , b ∗ , ξ ∗ i ) = arg min ξ i w , b ,ξ i i =1 s.t. y ( i ) ( w ⊤ φ ( x ( i ) ) + b ) ≥ 1 − ξ i and ξ i ≥ 0 , ∀ i = 1 , . . . , n 2 ∥ w ∥ 2 ∥ w ∥ , minimize 1 2 2 ∥ w ∥ 2 is monotonically decreasing with respect to ( 1 2 ∥ w ∥ ) C determines the trade-ofg between the error ∑ ξ i and the margin 2 ∥ w ∥

  14. . . . . . . . . . . . . . . . . . Support Vector Machines Dual Objective October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 10 / 28

  15. . . . . . . . . . . . . . . . . . 2 Approaches to Showing Kernelized Form for Dual (Generalized from derivation of Kernel Logistic Regression, Tutorial 7, Problem 3) See http://qwone.com/~jason/writing/kernel.pdf for list of kernelized objectives October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 11 / 28 1 Approach 1: The Reproducing Kernel Hilbert Space and Representer theorem 2 Approach 2: Derive using First principles (provided for completeness in Tutorial 9)

  16. . . . . . . . . . . . . . . Approach 1: Special case of Representer Theorem & Reproducing Kernel Hilbert Space (RKHS) http://qwone.com/~jason/writing/kernel.pdf for list of kernelized objectives m E f Reproducing (RKHS) Kernel 1 Proof provided in optional slide deck at the end October 27, 2016 . . . . . . . . . . . . . . 12 / 28 . . . . . . . . . . . . 1 Generalized from derivation of Kernel Logistic Regression, Tutorial 7, Problem 3. See { x (1) , x (2) , . . . , x ( m ) } 2 Let X be the space of examples such that D = ⊆ X and for any x ∈ X , K ( ., x ) : X → ℜ 3 (Optional) 1 The solution f ∗ ∈ H (Hilbert space) to the following problem ( ) f ∗ = argmin ( x ( i ) ) ∑ , y ( i ) + Ω( ∥ f ∥ K ) f ∈H i =1 can be always written as f ∗ ( x ) = ∑ m i =1 α i K ( x , x ( i ) ) , provided Ω( ∥ f ∥ K ) is a monotonically increasing function of ∥ f ∥ K . H is the Hilbert space and K ( ., x ) : X → ℜ is called the

  17. . . . . . . . . . . . . . . Approach 1: Special case of Representer Theorem & Reproducing Kernel . m E f m E f October 27, 2016 . Hilbert Space (RKHS) . . . . . . . . . . . . 13 / 28 . . . . . . . . . . . . 1 (Optional) The solution f ∗ ∈ H (Hilbert space) to the following problem ( ) f ∗ = argmin ( x ( i ) ) ∑ , y ( i ) + Ω( ∥ f ∥ K ) f ∈H i =1 can be always written as f ∗ ( x ) = ∑ m i =1 α i K ( x , x ( i ) ) , provided Ω( ∥ f ∥ K ) is a .... 2 More specifjcally, if f ( x ) = w T φ ( x ) + b and K ( x ′ , x ) = φ T ( x ) φ ( x ′ ) then the solution w ∗ ∈ ℜ n to the following problem ( ) ( x ( i ) ) ∑ ( w ∗ , b ∗ ) = argmin , y ( i ) + Ω( ∥ w ∥ 2 ) w , b i =1 can be always written as φ T ( x ) w ∗ + b = ∑ m i =1 α i K ( x , x ( i ) ) , provided Ω( ∥ w ∥ 2 ) is a monotonically increasing function of ∥ w ∥ 2 . ℜ n +1 is the Hilbert space and K ( ., x ) : X → ℜ is the Reproducing (RKHS) Kernel

Recommend


More recommend