Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins University gert@jhu.edu 520.776 Learning on Silicon http://bach.ece.jhu.edu/gert/courses/776 G. Cauwenberghs 520.776 Learning on Silicon
Statistical Learning Theory and Support Vector Machines OUTLINE • Introduction to Statistical Learning Theory – VC Dimension, Margin and Generalization – Support Vectors – Kernels • Cost Functions and Dual Formulation – Classification – Regression – Probability Estimation • Implementation: Practical Considerations – Sparsity – Incremental Learning • Hybrid SVM-HMM MAP Sequence Estimation – Forward Decoding Kernel Machines (FDKM) – Phoneme Sequence Recognition (TIMIT) G. Cauwenberghs 520.776 Learning on Silicon
Generalization and Complexity – Generalization is the key to supervised learning, for classification or regression. – Statistical Learning Theory offers a principled approach to understanding and controlling generalization performance. • The complexity of the hypothesis class of functions determines generalization performance. • Complexity relates to the effective number of function parameters, but effective control of margin yields low complexity even for infinite number of parameters. G. Cauwenberghs 520.776 Learning on Silicon
VC Dimension and Generalization Performance Vapnik and Chervonenkis, 1974 – For a discrete hypothesis space H of functions, with probability 1- δ : m 2 H 1 2 ∑ ≠ ≤ ≠ + E [ y f ( )] ( y f ( )) ln x x i i δ m m = i 1 Generalization error Empirical (training) error Complexity m ∑ = ≠ where minimizes empirical error over m f arg min ( y f ( )) x i i ∈ f H = i 1 training samples { x i , y i }, and |H| is the cardinality of H . – For a continuous hypothesis function space H , with probability 1- δ : ⎛ ⎞ m 1 c 1 ∑ ≠ ≤ ≠ + + ⎜ ⎟ E [ y f ( )] ( y f ( )) d ln x x δ i i ⎝ ⎠ m m = i 1 where d is the VC dimension of H , the largest number of points x i completely “shattered” (separated in all possible combinations) by elements of H . = ⋅ + – For linear classifiers in N dimensions, the VC f ( ) sgn( b ) x w X dimension is the number of parameters, N + 1 . – For linear classifiers with margin ρ over a domain contained within diameter D , the VC dimension is bounded by D/ ρ . G. Cauwenberghs 520.776 Learning on Silicon
Learning to Classify Linearly Separable Data – vectors X i – labels y i = ±1 G. Cauwenberghs 520.776 Learning on Silicon
Optimal Margin Separating Hyperplane Vapnik and Lerner, 1963 Vapnik and Chervonenkis, 1974 1 – vectors X i w – labels y i = ±1 = ⋅ + y sign ( b ) w X ⋅ + ≥ y ( b ) 1 w X i i min : w , b w G. Cauwenberghs 520.776 Learning on Silicon
Support Vectors Boser, Guyon and Vapnik, 1992 – vectors X i – labels y i = ±1 = ⋅ + y sign ( b ) w X ⋅ + ≥ y ( b ) 1 w X i i min : w , b w – support vectors: ⋅ + = ∈ y ( b ) 1 , i S w X i i ∑ = α i y X w i i ∈ i S G. Cauwenberghs 520.776 Learning on Silicon
Support Vector Machine (SVM) Boser, Guyon and Vapnik, 1992 – vectors X i – labels y i = ±1 = ⋅ + y sign ( b ) w X ⋅ + ≥ y ( b ) 1 w X i i min : w , b w – support vectors: ⋅ + = ∈ y ( b ) 1 , i S w X i i ∑ ∑ = α = α ⋅ + i y X w y sign ( y b ) X X i i i i i ∈ ∈ i S i S G. Cauwenberghs 520.776 Learning on Silicon
Soft Margin SVM Cortes and Vapnik, 1995 – vectors X i – labels y i = ±1 = ⋅ + y sign ( b ) w X [ ] ∑ + 2 + − ⋅ + 1 min : C 1 y ( b ) w w X i i 2 , b w i – support vectors: (margin and error vectors) ⋅ + ≤ ∈ y ( b ) 1 , i S w X i i ∑ ∑ = α = α ⋅ + i y X w y sign ( y b ) X X i i i i i ∈ ∈ i S i S G. Cauwenberghs 520.776 Learning on Silicon
Kernel Machines Mercer, 1909; Aizerman et al., 1964 Boser, Guyon and Vapnik, 1992 Φ ( ⋅ ) = Φ ( ) X x i i = Φ x ( ) X x X ⋅ = Φ ⋅ Φ ( ) ( ) X X x x i i ∑ ∑ = α Φ ⋅ Φ + = α ⋅ + y sign ( y ( ) ( ) b ) y sign ( y b ) x x X X i i i i i i ∈ ∈ i S i S ( ⋅ ⋅ Φ ⋅ Φ = K , ) ( ) ( ) K ( , ) x x x x i i Mercer’s Condition ∑ = α + y sign ( y K ( , ) b ) x x i i i ∈ i S G. Cauwenberghs 520.776 Learning on Silicon
Some Valid Kernels Boser, Guyon and Vapnik, 1992 – Polynomial (Splines etc.) ν = + ⋅ K ( , ) ( 1 ) x x x x i i – Gaussian (Radial Basis Function Networks) 2 − = − x x K ( , ) exp( ) x x i i 2 σ 2 – Sigmoid (Two-Layer Perceptron) = + ⋅ K ( , ) tanh( L ) x x x x only for certain L i i α 1 y x k 1 1 sign x y α k 2 y x 2 2 G. Cauwenberghs 520.776 Learning on Silicon
Other Ways to Arrive at Kernels… • Smoothness constraints in non-parametric regression [Wahba <<1999] – Splines are radially symmetric kernels. – Smoothness constraint in the Fourier domain relates directly to (Fourier transform of) kernel. • Reproducing Kernel Hilbert Spaces (RKHS) [Poggio 1990] ∑ = ϕ – The class of functions with orthogonal basis f ( ) c ( ) x x i i i ϕ ( x ) forms a reproducing Hilbert space. i – Regularization by minimizing the norm over Hilbert space yields a similar kernel expansion as SVMs. • Gaussian processes [MacKay 1998] – Gaussian prior on Hilbert coefficients yields Gaussian posterior on the output, with covariance given by kernels in input space. – Bayesian inference predicts the output label distribution for a new input vector given old (training) input vectors and output labels. G. Cauwenberghs 520.776 Learning on Silicon
Gaussian Processes Neal, 1994 MacKay, 1998 Opper and Winther, 2000 – Bayes: Posterior Evidence Prior ∝ P ( | y , ) P ( y | , ) P ( ) w x x w w – Hilbert space expansion, with additive white noise: ∑ = + = ϕ + y f ( ) n w ( ) n x x i i i – Uniform Gaussian prior on Hilbert coefficients: = σ 2 I P ( ) N ( 0 , ) w w yields Gaussian posterior on output: = N P ( y | , ) ( 0 , ) x w C = + σ 2 C Q I ν with kernel covariance ∑ = σ ϕ ϕ = 2 Q ( ) ( ) k ( , ). x x x x nm w i n i m n m i – Incremental learning can proceed directly through recursive computation of the inverse covariance (using a matrix inversion lemma). G. Cauwenberghs 520.776 Learning on Silicon
Kernel Machines: A General Framework = ⋅ Φ + = ⋅ + y f ( ( ) b ) f ( b ) w x w X ε ∑ 2 = + 1 min : C g ( z ) w i 2 , b w i Structural Risk Empirical Risk (SVMs) Smoothness Fidelity (Regularization Networks) Log Prior Log Evidence (Gaussian Processes) – g (.): convex cost function – z i : “margin” of each datapoint g ( z ) g ( z ) i i = ⋅ + = − ⋅ + z y ( b ) z y ( b ) w X w X i i i i i i Classification Regression G. Cauwenberghs 520.776 Learning on Silicon
Optimality Conditions ε ∑ 2 = + 1 min : C g ( z ) w i 2 , b w i = ⋅ + z y ( b ) w X (Classification) i i i – First-Order Conditions: ∑ ∑ ε ≡ = − = α d 0 : C g ' ( z ) y y w X X i i i i i i d w i i ∑ ∑ ε ≡ = − = α d 0 : 0 C g ' ( z ) y y i i i i db i i α = − Cg ' ( z ) with: i i ∑ = α + z Q by i ij j i j = Q y y K ( , ) x x ij i j i j α = = – Sparsity: requires 0 g ' ( z ) 0 i i G. Cauwenberghs 520.776 Learning on Silicon
Sparsity α i = 0 for z i > 1 α i > 0 Soft-Margin SVM Classification Logistic Probability Regression = ⋅ + = + − ⋅ + − y sign ( b ) w X y ( b ) 1 w X Pr( y | ) ( 1 e ) X [ ] + = − ⋅ + − ⋅ + = + g ( z ) 1 y ( b ) y ( b ) w X w X g ( z ) log( 1 e ) i i i i i i G. Cauwenberghs 520.776 Learning on Silicon
Dual Formulation (Legendre transformation) ∑ = α α = − y w X Cg ' ( z ) i i i i i i ∑ ∑ = α + = α z Q by 0 y i ij j i i i j i Eliminating the unknowns z i : ∑ α − = α + = − ' 1 z Q by g ( ) i i ij j i C j yields the equivalent of the first-order conditions of a “dual” functional ε 2 to be minimized in α i : ε ∑∑ ∑ α = α α − 1 min : Q C G ( ) i 2 i ij j 2 C , b w i j i ∑ α ≡ subject to : y 0 i i j with Lagrange parameter b, and “potential function” u ∫ = − − 1 G ( u ) g ' ( v ) dv G. Cauwenberghs 520.776 Learning on Silicon
520.776 Learning on Silicon Soft-Margin SVM Classification i ∀ , C α i ≤ α ∑ i Cortes and Vapnik, 1995 i ≤ − 0 j α and ij Q α i 0 ∑∑ ≡ j α i i i y 1 2 ∑ = i SVcM : subject to ε 2 : min b , w G. Cauwenberghs
Recommend
More recommend