machine learning theory
play

Machine learning theory Kernel methods Hamid Beigy Sharif - PowerPoint PPT Presentation

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20, 2020 Table of contents 1. Motivation 2. Kernel methods 3. Basic kernel operations in feature space 4. Kernel-based algorithms 5. Summary 1/24


  1. Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20, 2020

  2. Table of contents 1. Motivation 2. Kernel methods 3. Basic kernel operations in feature space 4. Kernel-based algorithms 5. Summary 1/24

  3. Motivation

  4. Introduction ◮ Most of learning algorithms are linear and are not able to classify non-linearly-separable data. ◮ How do you separate these two classes? ◮ Linear separation impossible in most problems. ◮ Non-linear mapping from input space to high-dimensional feature space: φ : X �→ H . φ ◮ Generalization ability: independent of dim ( H ), depends only on ρ and m . 2/24

  5. Kernel methods

  6. Ideas of kernels ◮ Most datasets are not linearly separable, for example ◮ Instances that are not linearly separable in R , may be linearly separable in R 2 by using mapping φ ( x ) = ( x , x 2 ). ◮ In this case, we have two solutions ◮ Increase dimensionality of data set by introducing mapping φ . ◮ Use a more complex model for classifier. 3/24

  7. Ideas of kernels ◮ To classify the non-linearly separable dataset, we use mapping φ . ◮ For example, let x = ( x 1 , x 2 ) T , z = ( z 1 , z 2 . z 3 ) T , and φ : R 2 → R 3 . √ ◮ If we use mapping z = φ ( x ) = ( x 2 2 x 1 x 2 , x 2 2 ) T , the dataset will be linearly separable in R 3 . 1 , ◮ Mapping dataset to higher dimensions has two major problems. ◮ In high dimensions, there is risk of over-fitting. ◮ In high dimensions, we have more computational cost. ◮ The generalization capability in higher dimension is ensured by using large margin classifiers. ◮ The mapping is an implicit mapping not explicit. 4/24

  8. Kernels ◮ Kernel methods avoid explicitly transforming each point x in the input space into the mapped point φ ( x ) in the feature space. ◮ Instead, the inputs are represented via their m × m pairwise similarity values. ◮ The similarity function, called a kernel , is chosen so that it represents a dot product in some high-dimensional feature space. ◮ The kernel can be computed without directly constructing φ . ◮ The pairwise similarity values between points in S represented via the m × m kernel matrix, defined as   k ( x 1 , x 1 ) k ( x 1 , x 2 ) · · · k ( x 1 , x m ) k ( x 2 , x 1 ) k ( x 2 , x 2 ) · · · k ( x 2 , x m )     K = . . . ...   . . .  . . .    k ( x m , x 1 ) k ( x m , x 2 ) · · · k ( x m , x m ) ◮ Function K ( x i , x j ) is called kernel function and defined as Definition (Kernel) Function K : X × X �→ R is a kernel if 1. ∃ φ : X �→ R N such that K ( x , y ) = � φ ( x ) , φ ( y ) � . 2. Range of φ is called the feature space. 3. N can be very large. 5/24

  9. Kernels (example) √ ◮ Let φ : R 2 �→ R 3 be defined as φ ( x ) = ( x 2 1 , x 2 2 , 2 x 1 x 2 ). ◮ Then � φ ( x ) , φ ( z ) � equals to √ √ � � ( x 2 1 , x 2 2 x 1 x 2 ) , ( z 2 1 , z 2 � φ ( x ) , φ ( z ) � = 2 , 2 , 2 z 1 z 2 ) = ( x 1 z 1 + x 2 z 2 ) 2 = ( � x , z � ) 2 K x, z � x ⋅ z � = K ( x , z ) . 2 , 𝑦 2 2 , 𝑦 1 , 𝑦 2 → Φ 𝑦 � �𝑦 1 2𝑦 1 𝑦 2 � ◮ The above mapping can be described Φ x 2 X X X X X X X X X X X X O X O O X X X x 1 O O O O O X X O O z 1 O O O X O O X X O X X X X X O X X z 3 X X X X X X X X Input space feature space 6/24

  10. Kernels (example) √ ◮ Let φ 1 : R 2 �→ R 3 be defined as φ ( x ) = ( x 2 1 , x 2 2 , 2 x 1 x 2 ). ◮ Then � φ 1 ( x ) , φ 1 ( z ) � equals to √ √ � � ( x 2 1 , x 2 2 x 1 x 2 ) , ( z 2 1 , z 2 � φ 1 ( x ) , φ 1 ( z ) � = 2 , 2 , 2 z 1 z 2 ) = ( x 1 z 1 + x 2 z 2 ) 2 = ( � x , z � ) 2 = K ( x , z ) . ◮ Let φ 2 : R 2 �→ R 4 be defined as φ ( x ) = ( x 2 1 , x 2 2 , x 1 x 2 , x 2 x 1 ). ◮ Then � φ 2 ( x ) , φ 2 ( z ) � equals to ��ick���. � � ( x 2 1 , x 2 2 , x 1 x 2 , x 2 x 1 ) , ( z 2 1 , z 2 � φ 2 ( x ) , φ 2 ( z ) � = 2 , z 1 z 2 , z 2 z 1 ) ϕ = ( � x , z � ) 2 = K ( x , z ) . 𝑒 𝑙 �, � � � � � � � � � ⋅ � � � ◮ Feature space can grow really large and really quickly. � , � � � � … � � , � � � � � … � ��� – � � ◮ Let K be a kernel K ( x , z ) = ( � x , z � ) d = � φ ( x ) , φ ( z ) � � ◮ The dimension of feature space equals to � d + n − 1 � . � 𝑒 � � � 1 ! d 𝑒 � � � 1 ◮ Let n = 100 , d = 6, there are1.6 billion terms. 𝑒 𝑒! � � 1 ! ms – 𝑒 � 6, � � 100, 7/24 𝑃 � 𝑑�����𝑏�𝑗��! 𝑙 �, � � � � � � � � � ⋅ � �

  11. Mercer’s condition ◮ The kernel methods have the following benefits. Efficiency: K is often more efficient to compute than φ and the dot product. Flexibility: K can be chosen arbitrarily so long as the existence of φ is guaranteed (Mercer’s condition). Theorem (Mercer’s condition) c ( x ) 2 dx < ∞ ), other than the zero function, the � For all functions c that are square integrable (i.e., following property holds: � � c ( x ) K ( x , z ) c ( z ) dxdz ≥ 0 . ◮ This Theorem states that K : X × X �→ R is a kernel if matrix K is positive semi-definite (PSD). ◮ Suppose x , z ∈ R n and consider the following kernel K ( x , z ) = ( � x , z � ) 2 ◮ It is a valid kernel because � n � � n � � � K ( x , z ) = x i z i x j z j i =1 j =1 n n � � = ( x i x j ) ( z i z j ) = � φ ( x ) , φ ( z ) � i =1 j =1 where the mapping φ for n = 2 is φ ( x ) = ( x 1 x 1 , x 1 x 2 , x 2 x 1 , x 2 x 2 ) T 8/24

  12. Polynomial kernels (example) ◮ Consider the polynomial kernel K ( x , z ) = ( � x , z � + c ) d for all x , z ∈ R n . ◮ For n = 2 and d = 2, K ( x , z ) = ( x 1 z 1 + x 2 y 2 + c ) 2 √ √ √ √ √ √ �� � � �� x 2 1 , x 2 z 2 1 , z 2 = 2 , 2 x 1 x 2 , 2 cx 1 , 2 cx 2 , c , 2 , 2 z 1 z 2 , 2 cz 1 , 2 cz 2 , c ◮ Using second-degree polynomial kernel with c = 1: √ x 2 2 x 1 x 2 √ √ √ ( − 1 , 1) (1 , 1) √ √ √ (1 , 1 , + 2 , − 2 , − 2 , 1) (1 , 1 , + 2 , + 2 , + 2 , 1) √ 2 x 1 x 1 √ √ √ (1 , − 1) √ √ √ ( − 1 , − 1) (1 , 1 , − 2 , − 2 , + 2 , 1) (1 , 1 , − 2 , + 2 , − 2 , 1) ◮ The left data is not linearly separable but the right one is. 9/24

  13. Some valid kernels ◮ Some valid kernel functions ◮ Polynomial kernels consider the kernel defined by K ( x , z ) = ( � x , z � + c ) d d is the degree of the polynomial and specified by the user and c is a constant. ◮ Radial basis function kernels consider the kernel defined by � � − � x − z � 2 K ( x , z ) = exp 2 σ 2 The width σ is specified by the user. This kernel corresponds to an infinite dimensional mapping φ . ◮ Sigmoid kernel consider the kernel defined by K ( x , z ) = tanh ( β 0 � x , z � + β 1 ) This kernel only meets Mercer’s condition for certain values of β 0 and β 1 . ◮ Homework: Please prove VC-dimension of the above kernels. 10/24

  14. Reproducing kernel Hilbert space ◮ We give the crucial property of PDS kernels, which is to induce an inner product in a Hilbert space. Lemma (Cauchy-Schwarz inequality for PDS kernels) Let K be a PDS kernel matrix. Then, for any x , z ∈ X , K ( x , z ) 2 ≤ K ( x , x ) K ( z , z ) Theorem (Reproducing kernel Hilbert space (RKHS)) Let K : X × X �→ R be a PDS kernel. Then, there exists a Hilbert space H and a mapping φ from X to H such that for all x , y ∈ X K ( x , y ) = � φ ( x ) , φ ( y ) � . ◮ This Theorem implies that PDS kernels can be used to implicitly define a feature space. 11/24

  15. Normalized kernel ◮ For any kernel K , we can associate a normalized kernel K n defined by  0 if (( K ( x , x ) = 0) ∨ ( K ( z , z ) = 0))     K n ( x , z ) = K ( x , z ) otherwise    � K ( x , x ) K ( z , z )  Lemma (Normalized PDS kernels) Let K be a PDS kernel. Then, the normalized kernel K n associated to K is PDS. Proof. 1. Let { x 1 , . . . , x m } ⊆ X and let c be an arbitrary vector in R n . 2. We will show that � m i , j =1 c i c j K n ( x i , x j ) ≥ 0. 3. By Lemma Cauchy-Schwarz inequality for PDS kernels, if K ( x i , x i ) = 0, then K ( x i , x j ) = 0 and thus K n ( x i , x i ) = 0 for all j ∈ { 1 , 2 , . . . , m } . 4. We can assume that K ( x i , x i ) > 0 for all i ∈ { 1 , 2 , . . . , m } . 5. Then, the sum can be rewritten as follows: 2 m m m � m � � � c i c j φ ( x i ) , φ ( x j ) c i c j K ( x i , x j ) c i φ ( x i ) � � � � � � c i c j K n ( x i , x j ) = = = ≥ 0 . � � � � � � φ ( x i ) � H . � φ ( x j ) � � φ ( x i ) � H � K ( x i , x i ) K ( x j , x j ) � H � � i , j =1 i , j =1 i , j =1 i =1 H 12/24

  16. Closure properties of PDS kernels ◮ The following theorem provides closure guarantees for all of these operations. Theorem (Closure properties of PDS kernels) PDS kernels are closed under 1. sum 2. product 3. tensor product 4. pointwise limit k =1 a k x k with a k ≥ 0 for all k ∈ N . 5. composition with a power series � ∞ Proof. We only proof the closeness under sum. Consider two valid kernel matrices K 1 and K 2 . 1. For any c ∈ R m , we have c T K 1 c ≥ 0 and c T K 2 c ≥ 0. 2. This implies that c T K 1 c + c T K 2 c ≥ 0. 3. Hence, we have c T ( K 1 + K 2 ) c ≥ 0. 4. Let K = K 1 + K 2 , which is a valid kernel. ◮ Homework: Please prove other closure properties of PDS kernels. 13/24

Recommend


More recommend