statistical learning theory and support vector machines
play

Statistical Learning Theory and Support Vector Machines Gert - PowerPoint PPT Presentation

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins University gert@jhu.edu 520.776 Learning on Silicon http://bach.ece.jhu.edu/gert/courses/776 G. Cauwenberghs 520.776 Learning on Silicon Statistical


  1. Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins University gert@jhu.edu 520.776 Learning on Silicon http://bach.ece.jhu.edu/gert/courses/776 G. Cauwenberghs 520.776 Learning on Silicon

  2. Statistical Learning Theory and Support Vector Machines OUTLINE • Introduction to Statistical Learning Theory – VC Dimension, Margin and Generalization – Support Vectors – Kernels • Cost Functions and Dual Formulation – Classification – Regression – Probability Estimation • Implementation: Practical Considerations – Sparsity – Incremental Learning • Hybrid SVM-HMM MAP Sequence Estimation – Forward Decoding Kernel Machines (FDKM) – Phoneme Sequence Recognition (TIMIT) G. Cauwenberghs 520.776 Learning on Silicon

  3. Generalization and Complexity – Generalization is the key to supervised learning, for classification or regression. – Statistical Learning Theory offers a principled approach to understanding and controlling generalization performance. • The complexity of the hypothesis class of functions determines generalization performance. • Complexity relates to the effective number of function parameters, but effective control of margin yields low complexity even for infinite number of parameters. G. Cauwenberghs 520.776 Learning on Silicon

  4. VC Dimension and Generalization Performance Vapnik and Chervonenkis, 1974 – For a discrete hypothesis space H of functions, with probability 1- δ : m 2 H 1 2 ∑ ≠ ≤ ≠ + E [ y f ( )] ( y f ( )) ln x x i i δ m m = i 1 Generalization error Empirical (training) error Complexity m ∑ = ≠ where minimizes empirical error over m f arg min ( y f ( )) x i i ∈ f H = i 1 training samples { x i , y i }, and |H| is the cardinality of H . – For a continuous hypothesis function space H , with probability 1- δ : ⎛ ⎞ m 1 c 1 ∑ ≠ ≤ ≠ + + ⎜ ⎟ E [ y f ( )] ( y f ( )) d ln x x δ i i ⎝ ⎠ m m = i 1 where d is the VC dimension of H , the largest number of points x i completely “shattered” (separated in all possible combinations) by elements of H . = ⋅ + – For linear classifiers in N dimensions, the VC f ( ) sgn( b ) x w X dimension is the number of parameters, N + 1 . – For linear classifiers with margin ρ over a domain contained within diameter D , the VC dimension is bounded by D/ ρ . G. Cauwenberghs 520.776 Learning on Silicon

  5. Learning to Classify Linearly Separable Data – vectors X i – labels y i = ±1 G. Cauwenberghs 520.776 Learning on Silicon

  6. Optimal Margin Separating Hyperplane Vapnik and Lerner, 1963 Vapnik and Chervonenkis, 1974 1 – vectors X i w – labels y i = ±1 = ⋅ + y sign ( b ) w X ⋅ + ≥ y ( b ) 1 w X i i min : w , b w G. Cauwenberghs 520.776 Learning on Silicon

  7. Support Vectors Boser, Guyon and Vapnik, 1992 – vectors X i – labels y i = ±1 = ⋅ + y sign ( b ) w X ⋅ + ≥ y ( b ) 1 w X i i min : w , b w – support vectors: ⋅ + = ∈ y ( b ) 1 , i S w X i i ∑ = α i y X w i i ∈ i S G. Cauwenberghs 520.776 Learning on Silicon

  8. Support Vector Machine (SVM) Boser, Guyon and Vapnik, 1992 – vectors X i – labels y i = ±1 = ⋅ + y sign ( b ) w X ⋅ + ≥ y ( b ) 1 w X i i min : w , b w – support vectors: ⋅ + = ∈ y ( b ) 1 , i S w X i i ∑ ∑ = α = α ⋅ + i y X w y sign ( y b ) X X i i i i i ∈ ∈ i S i S G. Cauwenberghs 520.776 Learning on Silicon

  9. Soft Margin SVM Cortes and Vapnik, 1995 – vectors X i – labels y i = ±1 = ⋅ + y sign ( b ) w X [ ] ∑ + 2 + − ⋅ + 1 min : C 1 y ( b ) w w X i i 2 , b w i – support vectors: (margin and error vectors) ⋅ + ≤ ∈ y ( b ) 1 , i S w X i i ∑ ∑ = α = α ⋅ + i y X w y sign ( y b ) X X i i i i i ∈ ∈ i S i S G. Cauwenberghs 520.776 Learning on Silicon

  10. Kernel Machines Mercer, 1909; Aizerman et al., 1964 Boser, Guyon and Vapnik, 1992 Φ ( ⋅ ) = Φ ( ) X x i i = Φ x ( ) X x X ⋅ = Φ ⋅ Φ ( ) ( ) X X x x i i ∑ ∑ = α Φ ⋅ Φ + = α ⋅ + y sign ( y ( ) ( ) b ) y sign ( y b ) x x X X i i i i i i ∈ ∈ i S i S ( ⋅ ⋅ Φ ⋅ Φ = K , ) ( ) ( ) K ( , ) x x x x i i Mercer’s Condition ∑ = α + y sign ( y K ( , ) b ) x x i i i ∈ i S G. Cauwenberghs 520.776 Learning on Silicon

  11. Some Valid Kernels Boser, Guyon and Vapnik, 1992 – Polynomial (Splines etc.) ν = + ⋅ K ( , ) ( 1 ) x x x x i i – Gaussian (Radial Basis Function Networks) 2 − = − x x K ( , ) exp( ) x x i i 2 σ 2 – Sigmoid (Two-Layer Perceptron) = + ⋅ K ( , ) tanh( L ) x x x x only for certain L i i α 1 y x k 1 1 sign x y α k 2 y x 2 2 G. Cauwenberghs 520.776 Learning on Silicon

  12. Other Ways to Arrive at Kernels… • Smoothness constraints in non-parametric regression [Wahba <<1999] – Splines are radially symmetric kernels. – Smoothness constraint in the Fourier domain relates directly to (Fourier transform of) kernel. • Reproducing Kernel Hilbert Spaces (RKHS) [Poggio 1990] ∑ = ϕ – The class of functions with orthogonal basis f ( ) c ( ) x x i i i ϕ ( x ) forms a reproducing Hilbert space. i – Regularization by minimizing the norm over Hilbert space yields a similar kernel expansion as SVMs. • Gaussian processes [MacKay 1998] – Gaussian prior on Hilbert coefficients yields Gaussian posterior on the output, with covariance given by kernels in input space. – Bayesian inference predicts the output label distribution for a new input vector given old (training) input vectors and output labels. G. Cauwenberghs 520.776 Learning on Silicon

  13. Gaussian Processes Neal, 1994 MacKay, 1998 Opper and Winther, 2000 – Bayes: Posterior Evidence Prior ∝ P ( | y , ) P ( y | , ) P ( ) w x x w w – Hilbert space expansion, with additive white noise: ∑ = + = ϕ + y f ( ) n w ( ) n x x i i i – Uniform Gaussian prior on Hilbert coefficients: = σ 2 I P ( ) N ( 0 , ) w w yields Gaussian posterior on output: = N P ( y | , ) ( 0 , ) x w C = + σ 2 C Q I ν with kernel covariance ∑ = σ ϕ ϕ = 2 Q ( ) ( ) k ( , ). x x x x nm w i n i m n m i – Incremental learning can proceed directly through recursive computation of the inverse covariance (using a matrix inversion lemma). G. Cauwenberghs 520.776 Learning on Silicon

  14. Kernel Machines: A General Framework = ⋅ Φ + = ⋅ + y f ( ( ) b ) f ( b ) w x w X ε ∑ 2 = + 1 min : C g ( z ) w i 2 , b w i Structural Risk Empirical Risk (SVMs) Smoothness Fidelity (Regularization Networks) Log Prior Log Evidence (Gaussian Processes) – g (.): convex cost function – z i : “margin” of each datapoint g ( z ) g ( z ) i i = ⋅ + = − ⋅ + z y ( b ) z y ( b ) w X w X i i i i i i Classification Regression G. Cauwenberghs 520.776 Learning on Silicon

  15. Optimality Conditions ε ∑ 2 = + 1 min : C g ( z ) w i 2 , b w i = ⋅ + z y ( b ) w X (Classification) i i i – First-Order Conditions: ∑ ∑ ε ≡ = − = α d 0 : C g ' ( z ) y y w X X i i i i i i d w i i ∑ ∑ ε ≡ = − = α d 0 : 0 C g ' ( z ) y y i i i i db i i α = − Cg ' ( z ) with: i i ∑ = α + z Q by i ij j i j = Q y y K ( , ) x x ij i j i j α = = – Sparsity: requires 0 g ' ( z ) 0 i i G. Cauwenberghs 520.776 Learning on Silicon

  16. Sparsity α i = 0 for z i > 1 α i > 0 Soft-Margin SVM Classification Logistic Probability Regression = ⋅ + = + − ⋅ + − y sign ( b ) w X y ( b ) 1 w X Pr( y | ) ( 1 e ) X [ ] + = − ⋅ + − ⋅ + = + g ( z ) 1 y ( b ) y ( b ) w X w X g ( z ) log( 1 e ) i i i i i i G. Cauwenberghs 520.776 Learning on Silicon

  17. Dual Formulation (Legendre transformation) ∑ = α α = − y w X Cg ' ( z ) i i i i i i ∑ ∑ = α + = α z Q by 0 y i ij j i i i j i Eliminating the unknowns z i : ∑ α − = α + = − ' 1 z Q by g ( ) i i ij j i C j yields the equivalent of the first-order conditions of a “dual” functional ε 2 to be minimized in α i : ε ∑∑ ∑ α = α α − 1 min : Q C G ( ) i 2 i ij j 2 C , b w i j i ∑ α ≡ subject to : y 0 i i j with Lagrange parameter b, and “potential function” u ∫ = − − 1 G ( u ) g ' ( v ) dv G. Cauwenberghs 520.776 Learning on Silicon

  18. 520.776 Learning on Silicon Soft-Margin SVM Classification i ∀ , C α i ≤ α ∑ i Cortes and Vapnik, 1995 i ≤ − 0 j α and ij Q α i 0 ∑∑ ≡ j α i i i y 1 2 ∑ = i SVcM : subject to ε 2 : min b , w G. Cauwenberghs

Recommend


More recommend