introduction to statistical learning and kernel machines
play

Introduction to Statistical Learning and Kernel Machines Hichem - PowerPoint PPT Presentation

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Introduction to Statistical Learning and Kernel Machines Hichem SAHBI CNRS UPMC June 2018 Hichem SAHBI Introduction to


  1. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Introduction to Statistical Learning and Kernel Machines Hichem SAHBI CNRS UPMC June 2018 Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  2. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Outline Introduction to Statistical Learning Definitions Probability Tools Generalization Bounds Machine Learning Algorithms Kernel Machines : Supervised and Unsupervised Learning The Representer Theorem Supervised Learning (Support vector machines and regression) Kernel Design (kernel combination, cdk kernels,...) Unsupervised Learning (kernel PCA and CCA) Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  3. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Outline Introduction to Statistical Learning Definitions Probability Tools Generalization Bounds Machine Learning Algorithms Kernel Machines : Supervised and Unsupervised Learning The Representer Theorem Supervised Learning (Support vector machines and regression) Kernel Design (kernel combination, cdk kernels,...) Unsupervised Learning (kernel PCA and CCA) Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  4. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) The Representer Theorem 1 Supervised Learning (SVMs and SVRs) 2 Kernel Design 3 Unsupervised Learning (Kernel PCA and CCA) 4 Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  5. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Pattern Recognition Problems Given a pattern (observation) X ∈ X , the goal is to predict the unknown label Y of X . Character recognition (OCR) : X is an image, Y is a letter. Face detection (resp. recognition) : X is an image, Y indicates the presence of a face in the picture (resp. identity). Text classification : X is a text, Y is a category (topic, spam/non spam,...). Medical diagnosis : X is a set of features (age, genome, ...), Y is the risk. Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  6. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Section 1 The Representer Theorem Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  7. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Regularization, Kernel Methods and Representer Theorem min R n ( g ) + λ Ω( g ) , λ ≥ 0 ( Tikhonov ) g ∈ G For a particular regularizer Ω( g ) and class G , the solution to the above problem (Kimeldorf and Wahba, 1971) n � { ( X i , Y i ) } n g α ( . ) = α i k ( ., X i ) , i = 1 ⊆ X × Y is fixed i = 1 k is a kernel : symmetric, continuous on X × X and positive definite (Mercer, 1909), k ( X , X ′ ) = � Φ( X ) , Φ( X ′ ) � . n � ( g ( X i ) − Y i ) 2 (kernel regression), R n ( g ) = 1 n i = 1 n � ( sign [ g ( X i )] − Y i ) 2 (e.g., max margin 1 R n ( g ) = 2 n i = 1 classifier, SVMs). Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  8. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Section 2 Supervised Learning (SVMs and SVRs) Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  9. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Support Vector Machines Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  10. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Support Vector Machines (Large Margin Classifiers) Let { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } , be a training set generated i.i.d w t x + b = − 1 w t x + b = 0 w t x + b = + 1 w 2 � w � 1 � w ′ X i + b � 2 w ′ w min Y i − 1 ≥ 0 , ∀ i s.c. w , b Optimality conditions lead to w = � i α i Y i X i and dual form α i − 1 � � � max α i α j Y i Y j � X i , X j � 2 { α i } i i j � α i ≥ 0 , ∀ i α i Y i = 0 s.t. and i Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  11. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Support Vector Machines (Large Margin Classifiers) Let { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } , be a training set generated i.i.d w t x + b = − 1 w t x + b = 0 w t x + b = + 1 w 2 � w � 1 � w ′ X i + b � 2 w ′ w min Y i − 1 ≥ 0 , ∀ i s.c. w , b Optimality conditions lead to w = � i α i Y i X i and dual form α i − 1 � � � max α i α j Y i Y j � X i , X j � 2 { α i } i i j � α i ≥ 0 , ∀ i α i Y i = 0 s.t. and i Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  12. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) VC Dimension of Large Margin Classifiers The set of hyperplane classifiers with a margin at least M has a VC dimension upper bounded by : h ≤ r 2 / M 2 , here r is the radius of the smallest sphere containing all the patterns X . Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  13. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Interpretation of Lagrange Multipliers α i > 0 implies Y i ( w ′ X i + b ) = 1 : X i is a support vector. α i = 0 implies Y i ( w ′ X i + b ) > 1 : X i is useless vector. w t x + b = − 1 w t x + b = 0 w t x + b = + 1 w 2 � w � Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  14. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Classification Function w t x + b = − 1 w t x + b = 0 w t x + b = + 1 w Classification function g α ( X ) − b = � w , X � � � = α i � X i , X � − α i � X i , X � Y i =+ 1 Y i = − 1 Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  15. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Linear Soft-SVMs Introduce slack variables { ξ 1 , . . . , ξ n } to allow misclassification. Trade-off large margin and misclassification. n 1 � 2 w ′ w + C min ξ i w , b i = 1 Y i ( w ′ X i + b ) + ξ i ≥ 1 , ∀ i s.c. ξ i ≥ 0 w t x + b = − 1 w t x + b = 0 w t x + b = + 1 ξ i w 2 � w � Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  16. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Dual Formulation α i − 1 � � � max { α i } α i α j Y i Y j � X i , X j � 2 i i j � 0 ≤ α i ≤ C , ∀ i i α i Y i = 0 s.c. and Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  17. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Non-Linear SVMs Φ Φ() Φ() Φ() Φ() Φ() Φ() Φ() Φ() α i − 1 � � � max α i α j Y i Y j � Φ( X i ) , Φ( X j ) � 2 { α i } i i j � α i ≥ 0 , ∀ i α i Y i = 0 s.t. and i � � g α ( X ) − b = α i � Φ( X i ) , Φ( X ) � − α i � Φ( X i ) , Φ( X ) � Y i =+ 1 Y i = − 1 The product � Φ( X ) , Φ( X ′ ) � defines a kernel k ( X , X ′ ) . Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  18. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Kernels Kernels are symmetric and positive (semi) definite functions that measure similarity between data. Positive semi definite means k ( X , X ′ ) = � Φ( X ) , Φ( X ′ ) � . � ∀ X 1 , . . . , X n ∈ X , ∀ c 1 , . . . , c n ∈ R , c i c j k ( X i , X j ) ≥ 0 i , j Equivalently, the Gram (kernel) matrix K with K ij = k ( X i , X j ) has positive eigenvalues. Kernels on vectorial data : linear � X , X ′ � , polynomial � � X − X ′ � 2 � ( 1 + � X , X ′ � ) p , Gaussian exp − 1 , etc. σ 2 Kernels can be designed using closure operations (additions, products, exponentiation, etc.) Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  19. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Examples (Linear vs Gaussian) Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

  20. The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Gaussian Kernel � � X − X ′ � 2 � − 1 k ( X , X ′ ) = � Φ( X ) , Φ( X ′ ) � = exp . σ 2 The dimension of the output space R H is ∞ . The Gaussian kernel has good generalization properties but requires a good selection of the scale parameter σ using the tedious cross validation process. over-fitting Generalization error over-smoothing tradeoff scale parameter Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

Recommend


More recommend