Kernels Course of Machine Learning Master Degree in Computer Science Giorgio Gambosi a.a. 2018-2019
Idea • Thus far, we have been assuming that each object that we deal with • For certain kinds of objects (text document, protein sequence, parse tree, etc.) it is not clear how to best represent them in this way 1. first approach: define a generative model of data (with latent variables) and define an object as the inferred values of latent variables 2. second approach: do not rely on vector representation, but just assume a similarity measure between objects is defined 2 R d can be represented as a fixed-size feature vector x ∈ I
Representation by pairwise comparison Idea such that algorithm will work for any type of data (vectors, strings, …) 3 • Define a comparison function κ : χ × χ �→ I R • Represent a set of data items x 1 , . . . x n by the n × n Gram matrix G G ij = κ ( x i , x j ) • G is always an n × n matrix, whatever the nature of data: the same
Kernel definition 4 Given a set χ , a function κ : χ 2 �→ I R is a kernel on χ if there exists a Hilbert space H (essentially, a vector space with dot product · ) and a map φ : χ �→ H such that for all x 1 , x 2 ∈ χ we have κ ( x 1 , x 2 ) = φ ( x 1 ) · φ ( x 2 ) R d for some We shall consider the particular but common case when H = I d > 0 , φ ( x ) = ( ϕ 1 ( x ) , . . . , ϕ d ( x ) and φ ( x 1 ) · φ ( x 2 ) = φ ( x 1 ) T φ ( x 2 ) φ is called a feature map a H a feature space of κ
Kernel definition Positive semidefinitess 5 Positive definitess of κ is a relevant property in this framework. Given a set χ , a function κ : χ 2 �→ I R is positive semidefinite if for all N , ( x 1 , . . . , x n ) ∈ χ n the corresponding Gram matrix is positive n ∈ I semidefinite, that is z T Gz ≥ 0 for all vectors z ∈ I R n
Why is positive semidefinitess relevant? 6 Let κ : χ × χ �→ I R . Then κ is a kernel iff for all sets { x 1 , x 2 , . . . , x n } the corresponding Gram matrix G is symmetric and positive semidefinite Only if : G ij = ϕ ( x i ) T ϕ ( x j ) then clearly G ij = G ji . Moreover for any R d z ∈ I d d d d z T Gz = z i φ ( x i ) T φ ( x j ) z j ∑ ∑ ∑ ∑ z i G ij z j = i =1 j =1 i =1 j =1 ( d d d n d d ) ∑ ∑ ∑ ∑ ∑ ∑ = z i ϕ k ( x i ) ϕ k ( x j ) z j = z i ϕ k ( x i ) ϕ k ( x j ) z j i =1 j =1 i =1 j =1 k =1 k =1 ( d ) 2 d ∑ ∑ = z i ϕ k ( x i ) ≥ 0 i =1 k =1
Why are positive definite kernels relevant? 7 compute an eigenvector decomposition If : Given { x 1 , x 2 , . . . , x n } if G is positive definite it is possible to G = U T ΛU where Λ is the diagonal matrix of eigenvalues λ i > 0 and the columns u 1 , . . . , u n of U are the corresponding eigenvectors. Then, 2 u i ) T ( Λ 1 1 2 u j ) G ij = ( Λ 1 2 u i we get Then if we define ϕ ( x i ) = Λ κ ( x i , x j ) = ϕ ( x i ) T ϕ ( x j ) = G ij This results is valid only wrt the domain { x 1 , x 2 , . . . , x n } . For the general case, consider n → ∞ (as for example in gaussian processes)
Why are positive definite kernels relevant? Using positive definite kernels allows to apply the kernel trick wherever useful. Kernel trick Any algorithm which processes finite-dimensional vectors in such a way to consider only pairwise dot products can be applied to higher (possibly infinite) dimensional vectors by replacing each dot product by a suitable application of a positive definite kernel. • Many practical applications • Vectors in the new space are manipulated only implicitly, through pairwise dot products, computed by evaluating the kernel function on the original pair of vectors Example: Support vector machines. Also, many linear models for regression involving only dot products. 8 and classification can be reformulated in terms of a dual representation
Dual representations: example Regularized sum of squares in regression with predefined basis function 9 φ ( x ) n ) 2 + λ J ( w ) = 1 ( w T φ ( x i ) − t i 2 w T w ∑ 2 i =1 = 1 2( Φw − t ) T ( Φw − t ) + λ 2 w T w R n × d it is Φ ij = ϕ j ( x i ) where by definition of Φ ∈ I Setting ∂J ( w ) = 0 , the resulting solution is ∂ w w = ( ΦΦ T + λ I d ) − 1 Φ T t = Φ T ( ΦΦ T + λ I n ) − 1 t ˆ R r × c it is since it is possible to prove that for any matrix A ∈ I ( A T A + λ I r ) − 1 A T = A T ( AA T + λ I c ) − 1
Dual representations: example 10 If we define the dual variables a = ( ΦΦ T + λ I n ) − 1 t , we get w = Φ T a . By substituting Φ T a to w we express the cost function in terms of a , instead of w , introducing a dual formulation of J . J ( a ) = 1 2 a T ΦΦ T ΦΦ T a + 1 2 t T t − a T ΦΦ T t + λ 2 a T ΦΦ T a = 1 2 a T GGa + 1 2 t T t − a T Gt + λ 2 a T Ga where G = ΦΦ T is the Gram matrix, such that by definition k =1 ϕ k ( x i ) ϕ k ( x j ) = φ ( x i ) T φ ( x j ) G ij = ∑ d
Dual representations: example We can use this to make predictions in a different way The prediction can be done in terms of dot products between different pairs where 11 Setting the gradient of ∂J ( a ) = 0 it results ∂ a a = ( G + I λ n ) − 1 t ˆ y ( x ) = w T φ ( x ) = a T Φ φ ( x ) = t T ( G + I λ n ) − 1 Φ φ ( x ) = k ( x )( G + I λ n ) − 1 t k ( x ) = φ ( x ) T Φ = ( φ ( x 1 ) T φ ( x ) , . . . , φ ( x T n φ ( x )) T = ( κ ( x 1 , x ) , . . . , κ ( x n , x )) T = ( κ 1 ( x ) , . . . , κ n ( x )) T of φ ( x ) , or in terms of the kernel function κ ( x i , x j ) = φ ( x i ) T φ ( x j )
Dual representations: another example items that have been considered as misclassified by the algorithm, • As well known, a perceptron is a linear classifier with prediction considered. where each item is weighted by the number of times it has been 12 y ( x ) = w T x • Its update rule is: If x i is misclassified, that is w T x i t i < 0 , then w := w + t i x i • If we assume a zero initial value for all w k , then w is the sum of all • We may then define a dual formulation by setting w = ∑ n k =1 a k x k , which results in prediction y ( x ) = ∑ n k =1 a k x T k x • and update rule: if x i is misclassified, that is ∑ n k =1 a k x T k x i < 0 , then a i := a i + 1 • a kernelized perceptron can be defined with k =1 a k φ ( x k ) T φ ( x ) or with y ( x ) = ∑ n y ( x ) = ∑ n k =1 a k κ ( x k , x ) , by just using a positive definite kernel κ
Kernelization: one more example Why referring to the dual representation? , even infinite). dimension (much larger than makes it possible to implicitly use feature space of very high : this base functions , and not to the set of kernel function • However, the dual approach makes it possible to refer only to the 13 • We do not explicitly compute vectors • This is a kernelized nearest-neighbor classifier and we obtain: • We can now replace the dot products by a valid positive definite kernel the Euclidean distance is considered • The k -nn classifier selects the label of the nearest neighbor: assume || x i − x j || 2 = x T i x i + x T j x j − 2 x T i x j d ( x i , x j ) 2 = κ ( x i , x i ) + κ ( x j , x j ) − 2 κ ( x i , x j ) • While in the original formulation of linear regression w can be derived by inverting the m × m matrix Φ T Φ , in the dual formulation computing a requires inverting the n × n matrix G + I λ . • Since usually n ≫ m , this seems to lead to a loss of efficiency.
Dealing with kernels method to define them must be applied. 14 R d are positive definite kernel, some Since not all functions f : χ �→ I • the straighforward way is just to define a basis function φ and define κ ( x 1 , x 2 ) = φ ( x 1 ) T φ ( x 2 ) . κ is a positive definite kernel since 1. ϕ ( x 1 ) T ϕ ( x 2 ) = ϕ ( x 2 ) T ϕ ( x 1 ) j =1 c i c j ϕ ( x i ) T ϕ ( x j ) = 2. ∑ n ∑ n j =1 c i c j κ ( x i , x j ) = ∑ n ∑ n i =1 i =1 i =1 c i ϕ ( x i ) ∥ 2 ≥ 0 ∥ ∑ n
Dealing with kernels order to ensure that such function is a valid kernel, apply Mercer’s 15 • a second method defines a possible kernel function κ directly: in theorem and prove that κ is a positive definite kernel by showing it is simmetric and the corresponding Gram matrix G is positive definite for all possible sets of items. In this case we do not define φ
A simple positive definite kernel is a positive definite kernel. In fact, 16 R 2 �→ I Let χ = I R : the function κ : I R defined as κ ( x 1 , x 2 ) = x 1 x 2 • x 1 x 2 = x 2 x 1 • ∑ n ∑ n j =1 c i c j κ ( x i , x j ) = ∑ n ∑ n j =1 c i c j x i x j = i =1 i =1 i =1 c i x i ) 2 ) ≥ 0 (∑ n
Another simple positive definite kernel is a positive definite kernel. In fact, 17 R d : the function κ : χ 2 �→ I Let χ = I R defined as κ ( x 1 , x 2 ) = x T 1 x 2 • x T 1 x 2 = x T 2 x 1 • ∑ n ∑ n j =1 c i c j κ ( x i , x j ) = ∑ n ∑ n j =1 c i c j x T i x j = i =1 i =1 i =1 c i x i ∥ 2 ≥ 0 ∥ ∑ n
Dealing with kernels 18 • a third method defines again a possible kernel function κ directly: in order to ensure that such function is a valid kernel, a basis function φ must be found such that κ ( x 1 , x 2 ) = φ ( x 1 ) T φ ( x 2 ) for all x 1 , x 2
Recommend
More recommend