Statistical Learning Theory and Support Vector Machines Gert - PowerPoint PPT Presentation

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins University gert@jhu.edu 520.776 Learning on Silicon http://bach.ece.jhu.edu/gert/courses/776 G. Cauwenberghs 520.776 Learning on Silicon

Statistical Learning Theory and Support Vector Machines OUTLINE • Introduction to Statistical Learning Theory – VC Dimension, Margin and Generalization – Support Vectors – Kernels • Cost Functions and Dual Formulation – Classification – Regression – Probability Estimation • Implementation: Practical Considerations – Sparsity – Incremental Learning • Hybrid SVM-HMM MAP Sequence Estimation – Forward Decoding Kernel Machines (FDKM) – Phoneme Sequence Recognition (TIMIT) G. Cauwenberghs 520.776 Learning on Silicon

Generalization and Complexity – Generalization is the key to supervised learning, for classification or regression. – Statistical Learning Theory offers a principled approach to understanding and controlling generalization performance. • The complexity of the hypothesis class of functions determines generalization performance. • Complexity relates to the effective number of function parameters, but effective control of margin yields low complexity even for infinite number of parameters. G. Cauwenberghs 520.776 Learning on Silicon

VC Dimension and Generalization Performance Vapnik and Chervonenkis, 1974 – For a discrete hypothesis space H of functions, with probability 1- δ : m 2 H 1 2 ∑ ≠ ≤ ≠ + E [ y f ( )] ( y f ( )) ln x x i i δ m m = i 1 Generalization error Empirical (training) error Complexity m ∑ = ≠ where minimizes empirical error over m f arg min ( y f ( )) x i i ∈ f H = i 1 training samples { x i , y i }, and |H| is the cardinality of H . – For a continuous hypothesis function space H , with probability 1- δ : ⎛ ⎞ m 1 c 1 ∑ ≠ ≤ ≠ + + ⎜ ⎟ E [ y f ( )] ( y f ( )) d ln x x δ i i ⎝ ⎠ m m = i 1 where d is the VC dimension of H , the largest number of points x i completely “shattered” (separated in all possible combinations) by elements of H . = ⋅ + – For linear classifiers in N dimensions, the VC f ( ) sgn( b ) x w X dimension is the number of parameters, N + 1 . – For linear classifiers with margin ρ over a domain contained within diameter D , the VC dimension is bounded by D/ ρ . G. Cauwenberghs 520.776 Learning on Silicon

Learning to Classify Linearly Separable Data – vectors X i – labels y i = ±1 G. Cauwenberghs 520.776 Learning on Silicon

Optimal Margin Separating Hyperplane Vapnik and Lerner, 1963 Vapnik and Chervonenkis, 1974 1 – vectors X i w – labels y i = ±1 = ⋅ + y sign ( b ) w X ⋅ + ≥ y ( b ) 1 w X i i min : w , b w G. Cauwenberghs 520.776 Learning on Silicon

Support Vectors Boser, Guyon and Vapnik, 1992 – vectors X i – labels y i = ±1 = ⋅ + y sign ( b ) w X ⋅ + ≥ y ( b ) 1 w X i i min : w , b w – support vectors: ⋅ + = ∈ y ( b ) 1 , i S w X i i ∑ = α i y X w i i ∈ i S G. Cauwenberghs 520.776 Learning on Silicon

Support Vector Machine (SVM) Boser, Guyon and Vapnik, 1992 – vectors X i – labels y i = ±1 = ⋅ + y sign ( b ) w X ⋅ + ≥ y ( b ) 1 w X i i min : w , b w – support vectors: ⋅ + = ∈ y ( b ) 1 , i S w X i i ∑ ∑ = α = α ⋅ + i y X w y sign ( y b ) X X i i i i i ∈ ∈ i S i S G. Cauwenberghs 520.776 Learning on Silicon

Soft Margin SVM Cortes and Vapnik, 1995 – vectors X i – labels y i = ±1 = ⋅ + y sign ( b ) w X [ ] ∑ + 2 + − ⋅ + 1 min : C 1 y ( b ) w w X i i 2 , b w i – support vectors: (margin and error vectors) ⋅ + ≤ ∈ y ( b ) 1 , i S w X i i ∑ ∑ = α = α ⋅ + i y X w y sign ( y b ) X X i i i i i ∈ ∈ i S i S G. Cauwenberghs 520.776 Learning on Silicon

Kernel Machines Mercer, 1909; Aizerman et al., 1964 Boser, Guyon and Vapnik, 1992 Φ ( ⋅ ) = Φ ( ) X x i i = Φ x ( ) X x X ⋅ = Φ ⋅ Φ ( ) ( ) X X x x i i ∑ ∑ = α Φ ⋅ Φ + = α ⋅ + y sign ( y ( ) ( ) b ) y sign ( y b ) x x X X i i i i i i ∈ ∈ i S i S ( ⋅ ⋅ Φ ⋅ Φ = K , ) ( ) ( ) K ( , ) x x x x i i Mercer’s Condition ∑ = α + y sign ( y K ( , ) b ) x x i i i ∈ i S G. Cauwenberghs 520.776 Learning on Silicon

Some Valid Kernels Boser, Guyon and Vapnik, 1992 – Polynomial (Splines etc.) ν = + ⋅ K ( , ) ( 1 ) x x x x i i – Gaussian (Radial Basis Function Networks) 2 − = − x x K ( , ) exp( ) x x i i 2 σ 2 – Sigmoid (Two-Layer Perceptron) = + ⋅ K ( , ) tanh( L ) x x x x only for certain L i i α 1 y x k 1 1 sign x y α k 2 y x 2 2 G. Cauwenberghs 520.776 Learning on Silicon

Other Ways to Arrive at Kernels… • Smoothness constraints in non-parametric regression [Wahba <<1999] – Splines are radially symmetric kernels. – Smoothness constraint in the Fourier domain relates directly to (Fourier transform of) kernel. • Reproducing Kernel Hilbert Spaces (RKHS) [Poggio 1990] ∑ = ϕ – The class of functions with orthogonal basis f ( ) c ( ) x x i i i ϕ ( x ) forms a reproducing Hilbert space. i – Regularization by minimizing the norm over Hilbert space yields a similar kernel expansion as SVMs. • Gaussian processes [MacKay 1998] – Gaussian prior on Hilbert coefficients yields Gaussian posterior on the output, with covariance given by kernels in input space. – Bayesian inference predicts the output label distribution for a new input vector given old (training) input vectors and output labels. G. Cauwenberghs 520.776 Learning on Silicon

Gaussian Processes Neal, 1994 MacKay, 1998 Opper and Winther, 2000 – Bayes: Posterior Evidence Prior ∝ P ( | y , ) P ( y | , ) P ( ) w x x w w – Hilbert space expansion, with additive white noise: ∑ = + = ϕ + y f ( ) n w ( ) n x x i i i – Uniform Gaussian prior on Hilbert coefficients: = σ 2 I P ( ) N ( 0 , ) w w yields Gaussian posterior on output: = N P ( y | , ) ( 0 , ) x w C = + σ 2 C Q I ν with kernel covariance ∑ = σ ϕ ϕ = 2 Q ( ) ( ) k ( , ). x x x x nm w i n i m n m i – Incremental learning can proceed directly through recursive computation of the inverse covariance (using a matrix inversion lemma). G. Cauwenberghs 520.776 Learning on Silicon

Kernel Machines: A General Framework = ⋅ Φ + = ⋅ + y f ( ( ) b ) f ( b ) w x w X ε ∑ 2 = + 1 min : C g ( z ) w i 2 , b w i Structural Risk Empirical Risk (SVMs) Smoothness Fidelity (Regularization Networks) Log Prior Log Evidence (Gaussian Processes) – g (.): convex cost function – z i : “margin” of each datapoint g ( z ) g ( z ) i i = ⋅ + = − ⋅ + z y ( b ) z y ( b ) w X w X i i i i i i Classification Regression G. Cauwenberghs 520.776 Learning on Silicon

Optimality Conditions ε ∑ 2 = + 1 min : C g ( z ) w i 2 , b w i = ⋅ + z y ( b ) w X (Classification) i i i – First-Order Conditions: ∑ ∑ ε ≡ = − = α d 0 : C g ' ( z ) y y w X X i i i i i i d w i i ∑ ∑ ε ≡ = − = α d 0 : 0 C g ' ( z ) y y i i i i db i i α = − Cg ' ( z ) with: i i ∑ = α + z Q by i ij j i j = Q y y K ( , ) x x ij i j i j α = = – Sparsity: requires 0 g ' ( z ) 0 i i G. Cauwenberghs 520.776 Learning on Silicon

Sparsity α i = 0 for z i > 1 α i > 0 Soft-Margin SVM Classification Logistic Probability Regression = ⋅ + = + − ⋅ + − y sign ( b ) w X y ( b ) 1 w X Pr( y | ) ( 1 e ) X [ ] + = − ⋅ + − ⋅ + = + g ( z ) 1 y ( b ) y ( b ) w X w X g ( z ) log( 1 e ) i i i i i i G. Cauwenberghs 520.776 Learning on Silicon

Dual Formulation (Legendre transformation) ∑ = α α = − y w X Cg ' ( z ) i i i i i i ∑ ∑ = α + = α z Q by 0 y i ij j i i i j i Eliminating the unknowns z i : ∑ α − = α + = − ' 1 z Q by g ( ) i i ij j i C j yields the equivalent of the first-order conditions of a “dual” functional ε 2 to be minimized in α i : ε ∑∑ ∑ α = α α − 1 min : Q C G ( ) i 2 i ij j 2 C , b w i j i ∑ α ≡ subject to : y 0 i i j with Lagrange parameter b, and “potential function” u ∫ = − − 1 G ( u ) g ' ( v ) dv G. Cauwenberghs 520.776 Learning on Silicon

520.776 Learning on Silicon Soft-Margin SVM Classification i ∀ , C α i ≤ α ∑ i Cortes and Vapnik, 1995 i ≤ − 0 j α and ij Q α i 0 ∑∑ ≡ j α i i i y 1 2 ∑ = i SVcM : subject to ε 2 : min b , w G. Cauwenberghs

Statistical Learning Theory and Support Vector Machines Gert - PowerPoint PPT Presentation

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins University gert@jhu.edu 520.776 Learning on Silicon http://bach.ece.jhu.edu/gert/courses/776 G. Cauwenberghs 520.776 Learning on Silicon Statistical

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Elevate Your Career Freshman Support Programs Merrimack College Compass Program Promise Program

3 rd Quarter Results Overview FYE 30 September 2018 29 November 2018 26 November 2018 Results

Annual Budget Hearing 2017-2018 Proposed School Budget Thursday, May 4, 2017 at 7 p.m.

Camp War Eagle Overview Meet the Camp War Eagle Team Mark Melissa Chris Taylor Nic Session

Web Design for Advising Resources Web Design for Advising Resources Rheena Munoz, ASC II FYE

Dedicated to advancing meaningful, measurable improvements in the way the health care delivery

From Seed to Series B Mike Miller General Partner Liquid 2 Ventures 1 My Background

Purdue 3D Printing Club Callout, Fall 2017 3DPC: Meet the Officers Alex Warner - President

Statistical Learning Theory and Support Vector Machines Gert - PowerPoint PPT Presentation

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins University gert@jhu.edu 520.776 Learning on Silicon http://bach.ece.jhu.edu/gert/courses/776 G. Cauwenberghs 520.776 Learning on Silicon Statistical

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Elevate Your Career Freshman Support Programs Merrimack College Compass Program Promise Program

3 rd Quarter Results Overview FYE 30 September 2018 29 November 2018 26 November 2018 Results

Annual Budget Hearing 2017-2018 Proposed School Budget Thursday, May 4, 2017 at 7 p.m.

Camp War Eagle Overview Meet the Camp War Eagle Team Mark Melissa Chris Taylor Nic Session

Web Design for Advising Resources Web Design for Advising Resources Rheena Munoz, ASC II FYE

Dedicated to advancing meaningful, measurable improvements in the way the health care delivery

From Seed to Series B Mike Miller General Partner Liquid 2 Ventures 1 My Background

Purdue 3D Printing Club Callout, Fall 2017 3DPC: Meet the Officers Alex Warner - President

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David