Natural Language Processing and Information Retrieval Kernel Methods Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it
Linear Classifier � The equation of a hyperplane is f ( ) = x ⋅ w + b = 0, , w ∈ ℜ n , b ∈ ℜ x x x � is the vector representing the classifying example w � is the gradient of the hyperplane � The classification function is h x ( ) sign( ( )) f x =
The main idea of Kernel Functions Mapping vectors in a space where they are linearly � x ( x ) separable → φ φ ( x ) ( x ) φ φ x x o ( o ) φ x ( x ) φ o ( o ) φ o ( x ) x φ ( o ) φ o ( o ) φ
A mapping example Given two masses m 1 and m 2 , one is constrained � Apply a force f a to the mass m 1 � Experiments � � Features m 1 , m 2 and f a We want to learn a classifier that tells when a mass m 1 will � get far away from m 2 If we consider the Gravitational Newton Law � m m f ( m , m , r ) C 1 2 = 1 2 2 r we need to find when f(m 1 , m 2 , r) < f a �
A mapping example (2) x ( x ,..., x ) ( x ) ( ( x ),..., ( x )) = → φ = φ φ 1 n 1 n The gravitational law is not linear so we need to change � space ( f , m , m , r ) ( k , x , y , z ) (ln f , ln m , ln m , ln r ) → = a 1 2 a 1 2 As � ln f ( m , m , r ) ln C ln m ln m 2 ln r c x y 2 z = + + − = + + − 1 2 1 2 We need the hyperplane � ln f a ln m ln m 2 ln r ln C 0 − − + − = 1 2 (ln m 1 ,ln m 2 ,-2ln r) ⋅ (x,y,z)- ln f a + ln C = 0, we can decide without error if the mass will get far away or not
A kernel-based Machine Perceptron training 0 ← ; b 0 ← 0; k ← 0; R ← max 1 ≤ i ≤ l || w 0 x i || do for i = 1 to if y i ( k ⋅ i + b k ) ≤ 0 then w x k + 1 = k + η y i w w x i b k + 1 = b k + η y i R 2 k = k + 1 endif endfor while an error is found return k,( k , b k ) w
Dual Representation for Classification � Each step of perceptron only training data is added with a certain weight ∑ w = y j x j α j j = 1.. � So the classification function % ( sgn( w ⋅ x j ⋅ ∑ x + b ) = sgn ' y j x + b * α j ' * & ) j = 1.. � Note that data only appears in the scalar product
Dual Representation for Learning � as well as the updating function x j ⋅ ∑ if y i ( y j x i + b ) ≤ 0 then α i = α i + η α j j = 1.. � The learning rate only affects the re-scaling of the η hyperplane, it does not affect the algorithm, so we can fix 1. η =
Dual Perceptron algorithm and Kernel functions We can rewrite the classification function as � h ( x ) = sgn( φ ⋅ φ ( y j φ ( x j ) ⋅ φ ( ∑ w x ) + b φ ) = sgn( x ) + b φ ) = α j j = 1.. y j k ( x j , ∑ = sgn( x ) + b φ ) α j i = 1.. As well as the updating function � % ( y j k ( x j , ∑ if y i ' x i ) + b φ * * ≤ 0 allora α i = α i + η α j ' & ) j = 1.. The learning rate does not affect the algorithm so we set it to η � 1. η =
Dual optimization problem of SVMs
Kernels in Support Vector Machines � In Soft Margin SVMs we maximize: � By using kernel functions we rewrite the problem as:
Kernel Function Definition � Kernels are the product of mapping functions such as ( ) = ( φ 1 ( ), φ 2 ( ),..., φ m ( x ∈ ℜ n , )) ∈ ℜ m x x x x φ
The Kernel Gram Matrix � With KM-based learning, the sole information used from the training data set is the Kernel Gram Matrix ! $ k ( x 1 , x 1 ) k ( x 1 , x 2 ) ... k ( x 1 , x m ) # & k ( x 2 , x 1 ) k ( x 2 , x 2 ) ... k ( x 2 , x m ) # & K training = # & ... ... ... ... # & k ( x m , x 1 ) k ( x m , x 2 ) ... k ( x m , x m ) # & " % � If the kernel is valid, K is symmetric definite-positive .
Valid Kernels
Valid Kernels cont’d � If the matrix is positive semi-definite then we can find a mapping φ implementing the kernel function
Mercer’s Theorem (finite space) K = K ( ! i , ! n ( ) i , j = 1 � Let us consider x x j ) � K symmetric ⇒ ∃ V: for Takagi factorization of a K = V " # V complex-symmetric matrix, where: � Λ is the diagonal matrix of the eigenvalues λ t of K ! n ( ) i = 1 v t = v ti � are the eigenvectors, i.e. the columns of V � Let us assume lambda values non-negative " : ! n ( ) t = 1 % & n , i = 1,.., n x i # $ t v ti
Mercer’s Theorem (sufficient conditions) � Therefore n Φ ( i ) ⋅ Φ ( ) ij = K ij = K ( i , , ∑ ( x x j ) = λ t v ti v tj = V Λ ' V x x j ) t = 1 � which implies that K is a kernel function
Mercer’s Theorem (necessary conditions) � Suppose we have negative eigenvalues λ s and ! eigenvectors the following point v s n n ! si " ( ! V ! ( ) t = # # z = v x i ) v $ t v ti v s % & = si i = 1 i = 1 � has the following norm: ! 2 = ! z " ! V ! V ! s = ! V ! z z = v v v s ' V # v # $ s # $ # $ s = ! ' K ! s = ! ! ! 2 < 0 v v v s ' % s v s = % s v s s this contradicts the geometry of the space.
Is it a valid kernel? � It may not be a kernel so we can use M´·M
Valid Kernel operations � k(x,z) = k 1 (x,z)+k 2 (x,z) � k(x,z) = k 1 (x,z)*k 2 (x,z) � k(x,z) = α k 1 (x,z) � k(x,z) = f(x)f(z) � k(x,z) = k 1 ( φ (x), φ (z)) � k(x,z) = x'Bz
Basic Kernels for unstructured data � Linear Kernel � Polynomial Kernel � Lexical kernel � String Kernel
Linear Kernel � In Text Categorization documents are word vectors " ( d x ) = ! x = (0,..,1,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,1) buy acquisition stocks sell market " ( d z ) = ! z = (0,..,1,..,0,..,1,..,0,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,0) buy company stocks sell ! ! x ! z � The dot product counts the number of features in common � This provides a sort of similarity
Feature Conjunction (polynomial Kernel) � The initial vectors are mapped in a higher space 2 2 ( x , x ) ( x , x , 2 x x , 2 x , 2 x , 1 ) Φ < > → 1 2 1 2 1 2 1 2 ( x 1 x ) � More expressive, as encodes 2 Stock+Market vs. Downtown+Market features � We can smartly compute the scalar product as ( x ) ( z ) Φ ⋅ Φ = 2 2 2 2 ( x , x , 2 x x , 2 x , 2 x , 1 ) ( z , z , 2 z z , 2 z , 2 z , 1 ) = ⋅ = 1 2 1 2 1 2 1 2 1 2 1 2 2 2 2 2 x z x z 2 x x z z 2 x z 2 x z 1 = + + + + + = 1 1 2 2 1 2 1 2 1 1 2 2 2 2 ( x z x z 1 ) ( x z 1 ) K ( x , z ) = + + = ⋅ + = 1 1 2 2 Poly
Document Similarity Doc 2 Doc 1 industry company telephone product market
Lexical Semantic Kernel [CoNLL 2005] � The document similarity is the SK function: ∑ SK ( d 1 , d 2 ) = s ( w 1 , w 2 ) w 1 ∈ d 1 , w 2 ∈ d 2 � where s is any similarity function between words, e.g. WordNet [Basili et al.,2005] similarity or LSA [Cristianini et al., 2002] � Good results when training data is small
Using character sequences " (" bank ") = ! x = (0,..,1,..,0,..,1,..,0,......1,..,0,..,1,..,0,..,1,..,0) bank ank bnk bk b " (" rank ") = ! z = (1,..,0,..,0,..,1,..,0,......0,..,1,..,0,..,1,..,0,..,1) rank ank rnk rk r ! ! x ! z � counts the number of common substrings x " ! ! z = # (" bank ") " # (" rank ") = k (" bank "," rank ")
String Kernel � Given two strings, the number of matches between their substrings is evaluated � E.g. Bank and Rank � B, a, n, k, Ba, Ban, Bank, Bk, an, ank, nk,.. � R, a , n , k, Ra, Ran, Rank, Rk, an, ank, nk,.. � String kernel over sentences and texts � Huge space but there are efficient algorithms
Formal Definition i 1 + 1 , where , where
Kernel between Bank and Rank
An example of string kernel computation
Efficient Evaluation � Dynamic Programming technique � Evaluate the spectrum string kernels � Substrings of size p � Sum the contribution of the different spectra
Efficient Evaluation
An example: SK(“Gatta”,”Cata”) � First, evaluate the SK with size p=1, i.e. “a”, “a”,”t”,”t”,”a”,”a” � Store this in the table SK p = 1
Evaluating DP2 � Evaluate the weight of the string of size p in case a character will be matched � This is done by multiplying the double summation by the number of substrings of size p-1
Recommend
More recommend