Introduction to Support Vector Machines Andreas Maletti Technische - PowerPoint PPT Presentation

Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006

1 The Problem 2 The Basics 3 The Proposed Solution

Learning by Machines Learning Rote Learning Reinforcement feedback at end memorization (Hash tables) ( Q [Watkins 89]) Clustering Induction generalizing examples grouping data (ID3 [Quinlan 79]) (CMLIB [Hartigan 75]) Analogy Discovery Genetic Alg. representation similarity unsupervised, no goal simulated evolution (JUPA [Yvon 94]) (GABIL [DeJong 93])

Supervised Learning Definition Supervised Learning: given nontrivial training data (labels known) predict test data (labels unknown) Implementations Clustering Rote Learning Nearest Neighbor Induction Hash tables [Cover, Hart 67] Neural Networks Decision Trees SVMs [McCulloch, Pitts 43] [Hunt 66] [Vapnik et al 92]

Problem Description—General Problem Classify a given input • binary classification: two classes • multi-class classification: several, but finitely many classes • regression: infinitely many classes Major Applications • Handwriting recognition • Cheminformatics (Quantitative Structure-Activity Relationship) • Pattern recognition • Spam detection (HP Labs, Palo Alto)

Problem Description—Specific Electricity Load Prediction Challenge 2001 • Power plant that supports energy demand of a region • Excess production expensive • Load varies substantially • Challenge won by libSVM [Chang, Lin 06] Problem • given: load and temperature for 730 days ( ≈ 70kB data) • predict: load for the next 365 days

Example Data Load 1997 850 800 750 700 650 Load 600 550 500 12:00 450 24:00 400 50 100 150 200 250 300 350 Day of Year

Problem Description—Formal Definition (cf. [Lin 01]) Given a training set S ⊆ R n × {− 1 , 1 } of correctly classified input x ∈ R n , where: data vectors � • every input data vector appears at most once in S • there exist input data vectors � p and � n such that ( � p , 1 ) ∈ S as well as ( � n , − 1 ) ∈ S (non-trivial) successfully classify unseen input data vectors.

Linear Classification [Vapnik 63] • Given: A training set S ⊆ R n × {− 1 , 1 } • Goal: Find a hyperplane that separates R n into halves that contain only elements of one class

Representation of Hyperplane Definition Hyperplane � n · ( � x − � x 0 ) = 0 n ∈ R n weight vector • � x ∈ R n input vector • � x 0 ∈ R n offset • � Alternatively: � w · � x + b = 0 Decision Function • training set S = { ( � x i , y i ) | 1 ≤ i ≤ k } • separating hyperplane � w · � x + b = 0 for S � > 0 if y i = 1 Decision: w · � � x i + b ⇒ f ( � x ) = sgn ( � w · � x + b ) < 0 if y i = − 1

Learn Hyperplane Problem • Given: training set S • Goal: coefficients � w and b of a separating hyperplane • Difficulty: several or no candidates for � w and b Solution [cf. Vapnik’s statistical learning theory] Select admissible � w and b with maximal margin (minimal distance to any input data vector) Observation We can scale � w and b such that � ≥ 1 if y i = 1 w · � � x i + b ≤ − 1 if y i = − 1

Maximizing the Margin • Closest points � x + and � x − (with � w · � x ± + b = ± 1) • Distance between � w · � x + b = ± 1: ( � w · � x + + b ) − ( � w · � x − + b ) 2 2 = w � = √ � � w � � � w · � � w 2 � w · � w • max � w ≡ min � √ w , b w , b 2 � w · � Basic (Primal) Support Vector Machine Form w , b 1 target: min � 2 ( � w · � w ) subject to: y i ( � w · � x i + b ) ≥ 1 ( i = 1 , . . . , k )

Non-separable Data Problem Maybe a linear separating hyperplane does not exist! Solution Allow training errors ξ i penalized by large penalty parameter C Standard (Primal) Support Vector Machine Form �� k 1 � target: min � 2 ( � w · � w ) + C i = 1 ξ i w , b ,� ξ y i ( � w · � x i + b ) ≥ 1 − ξ i subject to: ( i = 1 , . . . , k ) ξ i ≥ 0 If ξ i > 1, then misclassification of � x i

Higher Dimensional Feature Spaces Problem Data not separable because target function is essentially nonlinear! Approach Potentially separable in higher dimensional space • Map input vectors nonlinearly into high dimensional space (feature space) • Perform separation there

Higher Dimensional Feature Spaces Literature • Classic approach [Cover 65] • “Kernel trick” [Boser, Guyon, Vapnik 92] • Extension to soft margin [Cortes, Vapnik 95] Example (cf. [Lin 01]) Mapping φ from R 3 into feature space R 10 √ √ √ √ √ √ 2 x 3 , x 2 1 , x 2 2 , x 2 φ ( � x ) = ( 1 , 2 x 1 , 2 x 2 , 3 , 2 x 1 x 2 , 2 x 1 x 3 , 2 x 2 x 3 )

Adapted Standard Form Definition Standard (Primal) Support Vector Machine Form �� k 1 � target: min � 2 ( � w · � w ) + C i = 1 ξ i w , b ,� ξ y i ( � w · φ ( � x i ) + b ) ≥ 1 − ξ i subject to: ( i = 1 , . . . , k ) ξ i ≥ 0 w is a vector in a high dimensional space �

How to Solve? Problem Find � w and b from the standard SVM form Solution Solve via Lagrangian dual [Bazaraa et al 93]: w , b , � � � max � min � ξ L ( � ξ, � α ) α ≥ 0 ,� π ≥ 0 w , b ,� where w , b , � L ( � ξ, � α ) � k k k = � w · � w � � � � + C + α i ( 1 − ξ i − y i ( � w · φ ( � x i ) + b )) − ξ i π i ξ i 2 i = 1 i = 1 i = 1

Simplifying the Dual [Chen et al 03] Standard (Dual) Support Vector Machine Form α ) − � k α 1 target: min � 2 ( � α T Q � i = 1 α i � y · � α = 0 subject to: ( i = 1 , . . . , k ) 0 ≤ α i ≤ C � � where: Q ij = y i y j φ ( � x i ) · φ ( � x j ) Solution We obtain � w as k � w = α i y i φ ( � x i ) � i = 1

Where is the Benefit? α ∈ R k (dimension independent from feature space) • � • Only inner products in feature space Kernel Trick • Inner products efficiently calculated on input vectors via kernel K K ( � x i ,� x j ) = φ ( � x i ) · φ ( � x j ) • Select appropriate feature space • Avoid nonlinear transformation into feature space • Benefit from better separation properties in feature space

Kernels Example Mapping into feature space φ : R 3 → R 10 √ √ √ φ ( � x ) = ( 1 , 2 x 1 , 2 x 2 , . . . , 2 x 2 x 3 ) x j ) 2 . Kernel K ( � x i ,� x j ) = φ ( � x i ) · φ ( � x j ) = ( 1 + � x i · � Popular Kernels • Gaussian Radial Basis Function: (feature space is an infinite dimensional Hilbert space) x j � 2 ) g ( � x i ,� x j ) = exp ( − γ � � x i − � x j + 1 ) d • Polynomial: g ( � x i ,� x j ) = ( � x i · �

The Decision Function Observation • No need for � w because � k � � � � � � � f ( � x ) = sgn w · φ ( � x ) + b = sgn α i y i φ ( � x i ) · φ ( � x ) + b i = 1 • Uses only � x i (support vectors) where α i > 0 Few points determine the separation; borderline points

Support Vectors

Support Vector Machines Definition • Given: Kernel K and training set S • Goal: decision function f k � � α T Q � α � � target: min � − Q ij = y i y j K ( � x i ,� x j ) α i α 2 i = 1 � y · � α = 0 subject to: ( i = 1 , . . . , k ) 0 ≤ α i ≤ C � k � � decide: f ( � x ) = sgn α i y i K ( � x i ,� x ) + b i = 1

Quadratic Programming • Suppose Q ( k by k ) fully dense matrix • 70,000 training points � 70,000 variables • 70 , 000 2 · 4 B ≈ 19 GB : huge problem • Traditional methods: Newton, Quasi Newton cannot be directly applied • Current methods: • Decomposition [Osuna et al 97], [Joachims 98], [Platt 98] • Nearest point of two convex hulls [Keerthi et al 99]

Sample Implementation www.kernel-machines.org • Main forum on kernel machines • Lists over 250 active researchers • 43 competing implementations libSVM [Chang, Lin 06] • Supports binary and multi-class classification and regression • Beginners Guide for SVM classification • “Out of the box”-system (automatic data scaling, parameter selection) • Won EUNITE and IJCNN challenge

Application Accuracy Automatic Training using libSVM Application Training Data Features Classes Accuracy Astroparticle 3,089 4 2 96.9% Bioinformatics 391 20 3 85.2% Vehicle 1,243 21 2 87.8%

References Books • Statistical Learning Theory (Vapnik). Wiley, 1998 • Advances in Kernel Methods—Support Vector Learning (Schölkopf, Burges, Smola). MIT Press, 1999 • An Introduction to Support Vector Machines (Cristianini, Shawe-Taylor). Cambridge Univ., 2000 • Support Vector Machines—Theory and Applications (Wang). Springer, 2005

References Seminal Papers • A training algorithm for optimal margin classifiers (Boser, Guyon, Vapnik). COLT’92, ACM Press. • Support vector networks (Cortes, Vapnik). Machine Learning 20, 1995 • Fast training of support vector machines using sequential minimal optimization (Platt). In Advances in Kernel Methods , MIT Press, 1999 • Improvements to Platt’s SMO algorithm for SVM classifier design (Keerthi, Shevade, Bhattacharyya, Murthy). Technical Report, 1999

Introduction to Support Vector Machines Andreas Maletti Technische - PowerPoint PPT Presentation

Introduction to Support Vector Machines Andreas Maletti Technische Universitt Dresden Fakultt Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning Rote Learning Reinforcement feedback

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Improving Developmental Education Multiple Measures and Math Pathways Presented by : The Center

Damo Teo Mi Chua Mister Fahy Literature. So awesome we feel bad. What was the last book

Learning to Advise an Equational Prover Chad E. Brown 1 , Bartosz Piotrowski 1,2 , Josef Urban 1 1

Is Is Your Y Yong g Zhao P Professional Library y Complete? Save today on all

Concept Learning through General-to-Specific Ordering Based on Machine Learning, T.

The Problem Solving Rubric Stuck!, Aha!, Check and Reflect. Some problem solvers find this rubric

Council of Graduate Coordinators and Staff (CGCS) Meeting October 11 th 2013 Agenda Items

Introduction to Support Vector Machines Andreas Maletti Technische - PowerPoint PPT Presentation

Introduction to Support Vector Machines Andreas Maletti Technische Universitt Dresden Fakultt Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning Rote Learning Reinforcement feedback

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Improving Developmental Education Multiple Measures and Math Pathways Presented by : The Center

Damo Teo Mi Chua Mister Fahy Literature. So awesome we feel bad. What was the last book

Learning to Advise an Equational Prover Chad E. Brown 1 , Bartosz Piotrowski 1,2 , Josef Urban 1 1

Is Is Your Y Yong g Zhao P Professional Library y Complete? Save today on all

Concept Learning through General-to-Specific Ordering Based on Machine Learning, T.

The Problem Solving Rubric Stuck!, Aha!, Check and Reflect. Some problem solvers find this rubric

Council of Graduate Coordinators and Staff (CGCS) Meeting October 11 th 2013 Agenda Items

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David