introduction to support vector machines
play

Introduction to Support Vector Machines Andreas Maletti Technische - PowerPoint PPT Presentation

Introduction to Support Vector Machines Andreas Maletti Technische Universitt Dresden Fakultt Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning Rote Learning Reinforcement feedback


  1. Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006

  2. 1 The Problem 2 The Basics 3 The Proposed Solution

  3. Learning by Machines Learning Rote Learning Reinforcement feedback at end memorization (Hash tables) ( Q [Watkins 89]) Clustering Induction generalizing examples grouping data (ID3 [Quinlan 79]) (CMLIB [Hartigan 75]) Analogy Discovery Genetic Alg. representation similarity unsupervised, no goal simulated evolution (JUPA [Yvon 94]) (GABIL [DeJong 93])

  4. Supervised Learning Definition Supervised Learning: given nontrivial training data (labels known) predict test data (labels unknown) Implementations Clustering Rote Learning Nearest Neighbor Induction Hash tables [Cover, Hart 67] Neural Networks Decision Trees SVMs [McCulloch, Pitts 43] [Hunt 66] [Vapnik et al 92]

  5. Problem Description—General Problem Classify a given input • binary classification: two classes • multi-class classification: several, but finitely many classes • regression: infinitely many classes Major Applications • Handwriting recognition • Cheminformatics (Quantitative Structure-Activity Relationship) • Pattern recognition • Spam detection (HP Labs, Palo Alto)

  6. Problem Description—Specific Electricity Load Prediction Challenge 2001 • Power plant that supports energy demand of a region • Excess production expensive • Load varies substantially • Challenge won by libSVM [Chang, Lin 06] Problem • given: load and temperature for 730 days ( ≈ 70kB data) • predict: load for the next 365 days

  7. Example Data Load 1997 850 800 750 700 650 Load 600 550 500 12:00 450 24:00 400 50 100 150 200 250 300 350 Day of Year

  8. Problem Description—Formal Definition (cf. [Lin 01]) Given a training set S ⊆ R n × {− 1 , 1 } of correctly classified input x ∈ R n , where: data vectors � • every input data vector appears at most once in S • there exist input data vectors � p and � n such that ( � p , 1 ) ∈ S as well as ( � n , − 1 ) ∈ S (non-trivial) successfully classify unseen input data vectors.

  9. Linear Classification [Vapnik 63] • Given: A training set S ⊆ R n × {− 1 , 1 } • Goal: Find a hyperplane that separates R n into halves that contain only elements of one class

  10. Representation of Hyperplane Definition Hyperplane � n · ( � x − � x 0 ) = 0 n ∈ R n weight vector • � x ∈ R n input vector • � x 0 ∈ R n offset • � Alternatively: � w · � x + b = 0 Decision Function • training set S = { ( � x i , y i ) | 1 ≤ i ≤ k } • separating hyperplane � w · � x + b = 0 for S � > 0 if y i = 1 Decision: w · � � x i + b ⇒ f ( � x ) = sgn ( � w · � x + b ) < 0 if y i = − 1

  11. Learn Hyperplane Problem • Given: training set S • Goal: coefficients � w and b of a separating hyperplane • Difficulty: several or no candidates for � w and b Solution [cf. Vapnik’s statistical learning theory] Select admissible � w and b with maximal margin (minimal distance to any input data vector) Observation We can scale � w and b such that � ≥ 1 if y i = 1 w · � � x i + b ≤ − 1 if y i = − 1

  12. Maximizing the Margin • Closest points � x + and � x − (with � w · � x ± + b = ± 1) • Distance between � w · � x + b = ± 1: ( � w · � x + + b ) − ( � w · � x − + b ) 2 2 = w � = √ � � w � � � w · � � w 2 � w · � w • max � w ≡ min � √ w , b w , b 2 � w · � Basic (Primal) Support Vector Machine Form w , b 1 target: min � 2 ( � w · � w ) subject to: y i ( � w · � x i + b ) ≥ 1 ( i = 1 , . . . , k )

  13. Non-separable Data Problem Maybe a linear separating hyperplane does not exist! Solution Allow training errors ξ i penalized by large penalty parameter C Standard (Primal) Support Vector Machine Form �� k 1 � target: min � 2 ( � w · � w ) + C i = 1 ξ i w , b ,� ξ y i ( � w · � x i + b ) ≥ 1 − ξ i subject to: ( i = 1 , . . . , k ) ξ i ≥ 0 If ξ i > 1, then misclassification of � x i

  14. Higher Dimensional Feature Spaces Problem Data not separable because target function is essentially nonlinear! Approach Potentially separable in higher dimensional space • Map input vectors nonlinearly into high dimensional space (feature space) • Perform separation there

  15. Higher Dimensional Feature Spaces Literature • Classic approach [Cover 65] • “Kernel trick” [Boser, Guyon, Vapnik 92] • Extension to soft margin [Cortes, Vapnik 95] Example (cf. [Lin 01]) Mapping φ from R 3 into feature space R 10 √ √ √ √ √ √ 2 x 3 , x 2 1 , x 2 2 , x 2 φ ( � x ) = ( 1 , 2 x 1 , 2 x 2 , 3 , 2 x 1 x 2 , 2 x 1 x 3 , 2 x 2 x 3 )

  16. Adapted Standard Form Definition Standard (Primal) Support Vector Machine Form �� k 1 � target: min � 2 ( � w · � w ) + C i = 1 ξ i w , b ,� ξ y i ( � w · φ ( � x i ) + b ) ≥ 1 − ξ i subject to: ( i = 1 , . . . , k ) ξ i ≥ 0 w is a vector in a high dimensional space �

  17. How to Solve? Problem Find � w and b from the standard SVM form Solution Solve via Lagrangian dual [Bazaraa et al 93]: w , b , � � � max � min � ξ L ( � ξ, � α ) α ≥ 0 ,� π ≥ 0 w , b ,� where w , b , � L ( � ξ, � α ) � k k k = � w · � w � � � � + C + α i ( 1 − ξ i − y i ( � w · φ ( � x i ) + b )) − ξ i π i ξ i 2 i = 1 i = 1 i = 1

  18. Simplifying the Dual [Chen et al 03] Standard (Dual) Support Vector Machine Form α ) − � k α 1 target: min � 2 ( � α T Q � i = 1 α i � y · � α = 0 subject to: ( i = 1 , . . . , k ) 0 ≤ α i ≤ C � � where: Q ij = y i y j φ ( � x i ) · φ ( � x j ) Solution We obtain � w as k � w = α i y i φ ( � x i ) � i = 1

  19. Where is the Benefit? α ∈ R k (dimension independent from feature space) • � • Only inner products in feature space Kernel Trick • Inner products efficiently calculated on input vectors via kernel K K ( � x i ,� x j ) = φ ( � x i ) · φ ( � x j ) • Select appropriate feature space • Avoid nonlinear transformation into feature space • Benefit from better separation properties in feature space

  20. Kernels Example Mapping into feature space φ : R 3 → R 10 √ √ √ φ ( � x ) = ( 1 , 2 x 1 , 2 x 2 , . . . , 2 x 2 x 3 ) x j ) 2 . Kernel K ( � x i ,� x j ) = φ ( � x i ) · φ ( � x j ) = ( 1 + � x i · � Popular Kernels • Gaussian Radial Basis Function: (feature space is an infinite dimensional Hilbert space) x j � 2 ) g ( � x i ,� x j ) = exp ( − γ � � x i − � x j + 1 ) d • Polynomial: g ( � x i ,� x j ) = ( � x i · �

  21. The Decision Function Observation • No need for � w because � k � � � � � � � f ( � x ) = sgn w · φ ( � x ) + b = sgn α i y i φ ( � x i ) · φ ( � x ) + b i = 1 • Uses only � x i (support vectors) where α i > 0 Few points determine the separation; borderline points

  22. Support Vectors

  23. Support Vector Machines Definition • Given: Kernel K and training set S • Goal: decision function f k � � α T Q � α � � target: min � − Q ij = y i y j K ( � x i ,� x j ) α i α 2 i = 1 � y · � α = 0 subject to: ( i = 1 , . . . , k ) 0 ≤ α i ≤ C � k � � decide: f ( � x ) = sgn α i y i K ( � x i ,� x ) + b i = 1

  24. Quadratic Programming • Suppose Q ( k by k ) fully dense matrix • 70,000 training points � 70,000 variables • 70 , 000 2 · 4 B ≈ 19 GB : huge problem • Traditional methods: Newton, Quasi Newton cannot be directly applied • Current methods: • Decomposition [Osuna et al 97], [Joachims 98], [Platt 98] • Nearest point of two convex hulls [Keerthi et al 99]

  25. Sample Implementation www.kernel-machines.org • Main forum on kernel machines • Lists over 250 active researchers • 43 competing implementations libSVM [Chang, Lin 06] • Supports binary and multi-class classification and regression • Beginners Guide for SVM classification • “Out of the box”-system (automatic data scaling, parameter selection) • Won EUNITE and IJCNN challenge

  26. Application Accuracy Automatic Training using libSVM Application Training Data Features Classes Accuracy Astroparticle 3,089 4 2 96.9% Bioinformatics 391 20 3 85.2% Vehicle 1,243 21 2 87.8%

  27. References Books • Statistical Learning Theory (Vapnik). Wiley, 1998 • Advances in Kernel Methods—Support Vector Learning (Schölkopf, Burges, Smola). MIT Press, 1999 • An Introduction to Support Vector Machines (Cristianini, Shawe-Taylor). Cambridge Univ., 2000 • Support Vector Machines—Theory and Applications (Wang). Springer, 2005

  28. References Seminal Papers • A training algorithm for optimal margin classifiers (Boser, Guyon, Vapnik). COLT’92, ACM Press. • Support vector networks (Cortes, Vapnik). Machine Learning 20, 1995 • Fast training of support vector machines using sequential minimal optimization (Platt). In Advances in Kernel Methods , MIT Press, 1999 • Improvements to Platt’s SMO algorithm for SVM classifier design (Keerthi, Shevade, Bhattacharyya, Murthy). Technical Report, 1999

Recommend


More recommend