kernels
play

Kernels Course of Machine Learning Master Degree in Computer - PowerPoint PPT Presentation

Kernels Course of Machine Learning Master Degree in Computer Science Giorgio Gambosi a.a. 2018-2019 Idea Thus far, we have been assuming that each object that we deal with For certain kinds of objects (text document, protein sequence,


  1. Kernels Course of Machine Learning Master Degree in Computer Science Giorgio Gambosi a.a. 2018-2019

  2. Idea • Thus far, we have been assuming that each object that we deal with • For certain kinds of objects (text document, protein sequence, parse tree, etc.) it is not clear how to best represent them in this way 1. first approach: define a generative model of data (with latent variables) and define an object as the inferred values of latent variables 2. second approach: do not rely on vector representation, but just assume a similarity measure between objects is defined 2 R d can be represented as a fixed-size feature vector x ∈ I

  3. Representation by pairwise comparison Idea such that algorithm will work for any type of data (vectors, strings, …) 3 • Define a comparison function κ : χ × χ �→ I R • Represent a set of data items x 1 , . . . x n by the n × n Gram matrix G G ij = κ ( x i , x j ) • G is always an n × n matrix, whatever the nature of data: the same

  4. Kernel definition 4 Given a set χ , a function κ : χ 2 �→ I R is a kernel on χ if there exists a Hilbert space H (essentially, a vector space with dot product · ) and a map φ : χ �→ H such that for all x 1 , x 2 ∈ χ we have κ ( x 1 , x 2 ) = φ ( x 1 ) · φ ( x 2 ) R d for some We shall consider the particular but common case when H = I d > 0 , φ ( x ) = ( ϕ 1 ( x ) , . . . , ϕ d ( x ) and φ ( x 1 ) · φ ( x 2 ) = φ ( x 1 ) T φ ( x 2 ) φ is called a feature map a H a feature space of κ

  5. Kernel definition Positive semidefinitess 5 Positive definitess of κ is a relevant property in this framework. Given a set χ , a function κ : χ 2 �→ I R is positive semidefinite if for all N , ( x 1 , . . . , x n ) ∈ χ n the corresponding Gram matrix is positive n ∈ I semidefinite, that is z T Gz ≥ 0 for all vectors z ∈ I R n

  6. Why is positive semidefinitess relevant? 6 Let κ : χ × χ �→ I R . Then κ is a kernel iff for all sets { x 1 , x 2 , . . . , x n } the corresponding Gram matrix G is symmetric and positive semidefinite Only if : G ij = ϕ ( x i ) T ϕ ( x j ) then clearly G ij = G ji . Moreover for any R d z ∈ I d d d d z T Gz = z i φ ( x i ) T φ ( x j ) z j ∑ ∑ ∑ ∑ z i G ij z j = i =1 j =1 i =1 j =1 ( d d d n d d ) ∑ ∑ ∑ ∑ ∑ ∑ = z i ϕ k ( x i ) ϕ k ( x j ) z j = z i ϕ k ( x i ) ϕ k ( x j ) z j i =1 j =1 i =1 j =1 k =1 k =1 ( d ) 2 d ∑ ∑ = z i ϕ k ( x i ) ≥ 0 i =1 k =1

  7. Why are positive definite kernels relevant? 7 compute an eigenvector decomposition If : Given { x 1 , x 2 , . . . , x n } if G is positive definite it is possible to G = U T ΛU where Λ is the diagonal matrix of eigenvalues λ i > 0 and the columns u 1 , . . . , u n of U are the corresponding eigenvectors. Then, 2 u i ) T ( Λ 1 1 2 u j ) G ij = ( Λ 1 2 u i we get Then if we define ϕ ( x i ) = Λ κ ( x i , x j ) = ϕ ( x i ) T ϕ ( x j ) = G ij This results is valid only wrt the domain { x 1 , x 2 , . . . , x n } . For the general case, consider n → ∞ (as for example in gaussian processes)

  8. Why are positive definite kernels relevant? Using positive definite kernels allows to apply the kernel trick wherever useful. Kernel trick Any algorithm which processes finite-dimensional vectors in such a way to consider only pairwise dot products can be applied to higher (possibly infinite) dimensional vectors by replacing each dot product by a suitable application of a positive definite kernel. • Many practical applications • Vectors in the new space are manipulated only implicitly, through pairwise dot products, computed by evaluating the kernel function on the original pair of vectors Example: Support vector machines. Also, many linear models for regression involving only dot products. 8 and classification can be reformulated in terms of a dual representation

  9. Dual representations: example Regularized sum of squares in regression with predefined basis function 9 φ ( x ) n ) 2 + λ J ( w ) = 1 ( w T φ ( x i ) − t i 2 w T w ∑ 2 i =1 = 1 2( Φw − t ) T ( Φw − t ) + λ 2 w T w R n × d it is Φ ij = ϕ j ( x i ) where by definition of Φ ∈ I Setting ∂J ( w ) = 0 , the resulting solution is ∂ w w = ( ΦΦ T + λ I d ) − 1 Φ T t = Φ T ( ΦΦ T + λ I n ) − 1 t ˆ R r × c it is since it is possible to prove that for any matrix A ∈ I ( A T A + λ I r ) − 1 A T = A T ( AA T + λ I c ) − 1

  10. Dual representations: example 10 If we define the dual variables a = ( ΦΦ T + λ I n ) − 1 t , we get w = Φ T a . By substituting Φ T a to w we express the cost function in terms of a , instead of w , introducing a dual formulation of J . J ( a ) = 1 2 a T ΦΦ T ΦΦ T a + 1 2 t T t − a T ΦΦ T t + λ 2 a T ΦΦ T a = 1 2 a T GGa + 1 2 t T t − a T Gt + λ 2 a T Ga where G = ΦΦ T is the Gram matrix, such that by definition k =1 ϕ k ( x i ) ϕ k ( x j ) = φ ( x i ) T φ ( x j ) G ij = ∑ d

  11. Dual representations: example We can use this to make predictions in a different way The prediction can be done in terms of dot products between different pairs where 11 Setting the gradient of ∂J ( a ) = 0 it results ∂ a a = ( G + I λ n ) − 1 t ˆ y ( x ) = w T φ ( x ) = a T Φ φ ( x ) = t T ( G + I λ n ) − 1 Φ φ ( x ) = k ( x )( G + I λ n ) − 1 t k ( x ) = φ ( x ) T Φ = ( φ ( x 1 ) T φ ( x ) , . . . , φ ( x T n φ ( x )) T = ( κ ( x 1 , x ) , . . . , κ ( x n , x )) T = ( κ 1 ( x ) , . . . , κ n ( x )) T of φ ( x ) , or in terms of the kernel function κ ( x i , x j ) = φ ( x i ) T φ ( x j )

  12. Dual representations: another example items that have been considered as misclassified by the algorithm, • As well known, a perceptron is a linear classifier with prediction considered. where each item is weighted by the number of times it has been 12 y ( x ) = w T x • Its update rule is: If x i is misclassified, that is w T x i t i < 0 , then w := w + t i x i • If we assume a zero initial value for all w k , then w is the sum of all • We may then define a dual formulation by setting w = ∑ n k =1 a k x k , which results in prediction y ( x ) = ∑ n k =1 a k x T k x • and update rule: if x i is misclassified, that is ∑ n k =1 a k x T k x i < 0 , then a i := a i + 1 • a kernelized perceptron can be defined with k =1 a k φ ( x k ) T φ ( x ) or with y ( x ) = ∑ n y ( x ) = ∑ n k =1 a k κ ( x k , x ) , by just using a positive definite kernel κ

  13. Kernelization: one more example Why referring to the dual representation? , even infinite). dimension (much larger than makes it possible to implicitly use feature space of very high : this base functions , and not to the set of kernel function • However, the dual approach makes it possible to refer only to the 13 • We do not explicitly compute vectors • This is a kernelized nearest-neighbor classifier and we obtain: • We can now replace the dot products by a valid positive definite kernel the Euclidean distance is considered • The k -nn classifier selects the label of the nearest neighbor: assume || x i − x j || 2 = x T i x i + x T j x j − 2 x T i x j d ( x i , x j ) 2 = κ ( x i , x i ) + κ ( x j , x j ) − 2 κ ( x i , x j ) • While in the original formulation of linear regression w can be derived by inverting the m × m matrix Φ T Φ , in the dual formulation computing a requires inverting the n × n matrix G + I λ . • Since usually n ≫ m , this seems to lead to a loss of efficiency.

  14. Dealing with kernels method to define them must be applied. 14 R d are positive definite kernel, some Since not all functions f : χ �→ I • the straighforward way is just to define a basis function φ and define κ ( x 1 , x 2 ) = φ ( x 1 ) T φ ( x 2 ) . κ is a positive definite kernel since 1. ϕ ( x 1 ) T ϕ ( x 2 ) = ϕ ( x 2 ) T ϕ ( x 1 ) j =1 c i c j ϕ ( x i ) T ϕ ( x j ) = 2. ∑ n ∑ n j =1 c i c j κ ( x i , x j ) = ∑ n ∑ n i =1 i =1 i =1 c i ϕ ( x i ) ∥ 2 ≥ 0 ∥ ∑ n

  15. Dealing with kernels order to ensure that such function is a valid kernel, apply Mercer’s 15 • a second method defines a possible kernel function κ directly: in theorem and prove that κ is a positive definite kernel by showing it is simmetric and the corresponding Gram matrix G is positive definite for all possible sets of items. In this case we do not define φ

  16. A simple positive definite kernel is a positive definite kernel. In fact, 16 R 2 �→ I Let χ = I R : the function κ : I R defined as κ ( x 1 , x 2 ) = x 1 x 2 • x 1 x 2 = x 2 x 1 • ∑ n ∑ n j =1 c i c j κ ( x i , x j ) = ∑ n ∑ n j =1 c i c j x i x j = i =1 i =1 i =1 c i x i ) 2 ) ≥ 0 (∑ n

  17. Another simple positive definite kernel is a positive definite kernel. In fact, 17 R d : the function κ : χ 2 �→ I Let χ = I R defined as κ ( x 1 , x 2 ) = x T 1 x 2 • x T 1 x 2 = x T 2 x 1 • ∑ n ∑ n j =1 c i c j κ ( x i , x j ) = ∑ n ∑ n j =1 c i c j x T i x j = i =1 i =1 i =1 c i x i ∥ 2 ≥ 0 ∥ ∑ n

  18. Dealing with kernels 18 • a third method defines again a possible kernel function κ directly: in order to ensure that such function is a valid kernel, a basis function φ must be found such that κ ( x 1 , x 2 ) = φ ( x 1 ) T φ ( x 2 ) for all x 1 , x 2

Recommend


More recommend