support vector machines svms semi supervised learning
play

Support Vector Machines (SVMs). Semi-Supervised Learning. - PowerPoint PPT Presentation

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs. Maria-Florina Balcan 03/25/2015 Support Vector Machines (SVMs). One of the most theoretically well motivated and practically most e ff ective


  1. Support Vector Machines (SVMs). • Semi-Supervised Learning. • Semi-Supervised SVMs. • Maria-Florina Balcan 03/25/2015

  2. Support Vector Machines (SVMs). One of the most theoretically well motivated and practically most e ff ective classification algorithms in machine learning. Directly motivated by Margins and Kernels!

  3. Geometric Margin WLOG homogeneous linear separators [w 0 = 0] . Definition: The margin of example 𝑦 w.r.t. a linear sep. 𝑥 is the distance from 𝑦 to the plane 𝑥 ⋅ 𝑦 = 0 . Margin of example 𝑦 1 If 𝑥 = 1 , margin of x w.r.t. w is | 𝑦 ⋅ 𝑥| . 𝑦 1 w Margin of example 𝑦 2 𝑦 2

  4. Geometric Margin Definition: The margin of example 𝑦 w.r.t. a linear sep. 𝑥 is the distance from 𝑦 to the plane 𝑥 ⋅ 𝑦 = 0 . Definition: The margin 𝛿 𝑥 of a set of examples 𝑇 wrt a linear separator 𝑥 is the smallest margin over points 𝑦 ∈ 𝑇 . Definition: The margin 𝛿 of a set of examples 𝑇 is the maximum 𝛿 𝑥 over all linear separators 𝑥 . w + + 𝛿 - 𝛿 + + - - - + - - - - -

  5. Margin Important Theme in ML Both sample complexity and algorithmic implications. Sample/Mistake Bound complexity : • If large margin, # mistakes Peceptron makes is small (independent on the dim of the space)! If large margin 𝛿 and if alg. produces a large • w + + 𝛿 margin classifier, then amount of data needed - 𝛿 + depends only on R/𝛿 [Bartlett & Shawe- Taylor ’99 ] . + - - - + - - Algorithmic Implications - - - Suggests searching for a large margin classifier… SVMs

  6. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs First, assume we know a lower bound on the margin 𝛿 w + + Input: 𝛿 , S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; 𝛿 - 𝛿 + Find: some w where: + - - - + - 2 = 1 - • w - • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 - - Output: w, a separator of margin 𝛿 over S Realizable case, where the data is linearly separable by margin 𝛿

  7. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs E.g., search for the best possible 𝛿 w + + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; 𝛿 - 𝛿 + Find: some w and maximum 𝛿 where: + - - - + - 2 = 1 - • w - • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 - - Output: maximum margin separator over S

  8. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; Maximize 𝛿 under the constraint: 2 = 1 w + + 𝛿 • w - 𝛿 + • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 + - - - + - - - - -

  9. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs This is a Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; constrained Maximize 𝛿 under the constraint: optimization 2 = 1 problem. • w • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 objective constraints function • Famous example of constrained optimization: linear programming , where objective fn is linear, constraints are linear (in)equalities

  10. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs w + + 𝛿 Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; - 𝛿 + Maximize 𝛿 under the constraint: + - - - + 2 = 1 - • w - - • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 - - 𝑥 1 This constraint is non-linear. 𝑥 1 + 𝑥 2 2 In fact, it’s even non -convex 𝑥 2

  11. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs w + + 𝛿 Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; - 𝛿 + Maximize 𝛿 under the constraint: + - - - + 2 = 1 - • w - - • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 - - 𝑥’ = 𝑥/𝛿 , then max 𝛿 is equiv. to minimizing ||𝑥’|| 2 (since ||𝑥’|| 2 = 1/𝛿 2 ). So, dividing both sides by 𝛿 and writing in terms of w’ we get : w’ 𝑥’ ⋅ 𝑦 = −1 + + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; - + + 2 under the constraint: - - Minimize 𝑥′ - + - - • For all i, 𝑧 𝑗 𝑥′ ⋅ 𝑦 𝑗 ≥ 1 - - 𝑥’ ⋅ 𝑦 = 1 -

  12. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; This is a constrained 2 s.t.: argmin w 𝑥 optimization problem. • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 • The objective is convex (quadratic) • All constraints are linear • Can solve efficiently (in poly time) using standard quadratic programing (QP) software

  13. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? - Issue 1: now have two objectives 𝑥 ⋅ 𝑦 = −1 w + + maximize margin • - + minimize # of misclassifications. • - - - + - - Ans 1: Let’s optimize their sum: minimize + 𝑥 ⋅ 𝑦 = 1 - - 2 + 𝐷 (# misclassifications) 𝑥 where 𝐷 is some tradeoff constant. Issue 2: This is computationally hard (NP-hard). [even if didn’t care about margin and minimized # mistakes] NP-hard [Guruswami-Raghavendra ’06]

  14. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? - Issue 1: now have two objectives 𝑥 ⋅ 𝑦 = −1 w + + maximize margin • - + minimize # of misclassifications. • - - - + - - Ans 1: Let’s optimize their sum: minimize + 𝑥 ⋅ 𝑦 = 1 - - 2 + 𝐷 (# misclassifications) 𝑥 where 𝐷 is some tradeoff constant. Issue 2: This is computationally hard (NP-hard). [even if didn’t care about margin and minimized # mistakes] NP-hard [Guruswami-Raghavendra ’06]

  15. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? R eplace “# mistakes” with upper bound called “hinge loss” w’ 𝑥’ ⋅ 𝑦 = −1 + + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; - + + 2 under the constraint: - - Minimize 𝑥′ - + - - • For all i, 𝑧 𝑗 𝑥′ ⋅ 𝑦 𝑗 ≥ 1 - - 𝑥’ ⋅ 𝑦 = 1 - - Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; 𝑥 ⋅ 𝑦 = −1 w + 2 + 𝐷 𝜊 𝑗 + Find s.t.: argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 - + 𝑗 - - • - For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 + - - 𝜊 𝑗 ≥ 0 + 𝑥 ⋅ 𝑦 = 1 - 𝜊 𝑗 are “slack variables” -

  16. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? R eplace “# mistakes” with upper bound called “hinge loss” - 𝑥 ⋅ 𝑦 = −1 w + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; + - + 2 + 𝐷 𝜊 𝑗 Find s.t.: - - argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 - 𝑗 + • - - For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 + 𝑥 ⋅ 𝑦 = 1 𝜊 𝑗 ≥ 0 - - 𝜊 𝑗 are “slack variables” C controls the relative weighting between the 2 small (margin is twin goals of making the 𝑥 large) and ensuring that most examples have functional margin ≥ 1 . (0,1 − 𝑧 𝑥 ⋅ 𝑦) 𝑚 𝑥, 𝑦, 𝑧 = max

  17. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? R eplace “# mistakes” with upper bound called “hinge loss” - 𝑥 ⋅ 𝑦 = −1 w + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; + - + 2 + 𝐷 𝜊 𝑗 Find s.t.: - - argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 - 𝑗 + • - - For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 + 𝑥 ⋅ 𝑦 = 1 𝜊 𝑗 ≥ 0 - - Replace the number of mistakes with the hinge loss 2 + 𝐷 (# misclassifications) 𝑥 (0,1 − 𝑧 𝑥 ⋅ 𝑦) 𝑚 𝑥, 𝑦, 𝑧 = max

  18. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? R eplace “# mistakes” with upper bound called “hinge loss” - 𝑥 ⋅ 𝑦 = −1 w + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; + - + 2 + 𝐷 𝜊 𝑗 Find s.t.: - - argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 - 𝑗 + • - - For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 + 𝑥 ⋅ 𝑦 = 1 𝜊 𝑗 ≥ 0 - - Total amount have to move the points to get them on the correct side of the lines 𝑥 ⋅ 𝑦 = +1/−1 , where the distance between the lines 𝑥 ⋅ 𝑦 = 0 and 𝑥 ⋅ 𝑦 = 1 counts as “1 unit”. (0,1 − 𝑧 𝑥 ⋅ 𝑦) 𝑚 𝑥, 𝑦, 𝑧 = max

  19. What if the data is far from being linearly separable? No good linear vs Example: separator in pixel representation. SVM philosophy: “ Use Kernel ” e a Ke

  20. Support Vector Machines (SVMs) Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; Primal 2 + 𝐷 𝜊 𝑗 Find s.t.: argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 form 𝑗 • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 𝜊 𝑗 ≥ 0 Which is equivalent to: Input: S={( x 1 , y 1 ) , …,( x m , y m )}; Lagrangian Dual 1 s.t.: Find 2 y i y j α i α j x i ⋅ x j − α i argmin α i j i • For all i, 0 ≤ α i ≤ C i y i α i = 0 i

  21. SVMs (Lagrangian Dual) Input: S={( x 1 , y 1 ) , …,( x m , y m )}; 1 s.t.: Find 2 y i y j α i α j x i ⋅ x j − α i argmin α i j i • For all i, 0 ≤ α i ≤ C i + - y i α i = 0 𝑥 ⋅ 𝑦 = −1 w i + - + - - - • Final classifier is: w = α i y i x i + i - - • The points x i for which α i ≠ 0 + 𝑥 ⋅ 𝑦 = 1 - are called the “support vectors” -

Recommend


More recommend