Natural Language Processing and Information Retrieval Kernel - PowerPoint PPT Presentation

Natural Language Processing and Information Retrieval Kernel Methods Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it

Linear Classifier � The equation of a hyperplane is f (  ) =  x ⋅  w + b = 0,  ,  w ∈ ℜ n , b ∈ ℜ x x  x � is the vector representing the classifying example  w � is the gradient of the hyperplane � The classification function is h x ( ) sign( ( )) f x =

The main idea of Kernel Functions Mapping vectors in a space where they are linearly   � x ( x ) separable → φ φ ( x ) ( x ) φ φ x x o ( o ) φ x ( x ) φ o ( o ) φ o ( x ) x φ ( o ) φ o ( o ) φ

A mapping example Given two masses m 1 and m 2 , one is constrained � Apply a force f a to the mass m 1 � Experiments � � Features m 1 , m 2 and f a We want to learn a classifier that tells when a mass m 1 will � get far away from m 2 If we consider the Gravitational Newton Law � m m f ( m , m , r ) C 1 2 = 1 2 2 r we need to find when f(m 1 , m 2 , r) < f a �

A mapping example (2)     x ( x ,..., x ) ( x ) ( ( x ),..., ( x )) = → φ = φ φ 1 n 1 n The gravitational law is not linear so we need to change � space ( f , m , m , r ) ( k , x , y , z ) (ln f , ln m , ln m , ln r ) → = a 1 2 a 1 2 As � ln f ( m , m , r ) ln C ln m ln m 2 ln r c x y 2 z = + + − = + + − 1 2 1 2 We need the hyperplane � ln f a ln m ln m 2 ln r ln C 0 − − + − = 1 2 (ln m 1 ,ln m 2 ,-2ln r) ⋅ (x,y,z)- ln f a + ln C = 0, we can decide without error if the mass will get far away or not

A kernel-based Machine Perceptron training 0 ←   ; b 0 ← 0; k ← 0; R ← max 1 ≤ i ≤ l ||  w 0 x i || do for i = 1 to  if y i (  k ⋅  i + b k ) ≤ 0 then w x  k + 1 =  k + η y i  w w x i b k + 1 = b k + η y i R 2 k = k + 1 endif endfor while an error is found return k,(  k , b k ) w

Dual Representation for Classification � Each step of perceptron only training data is added with a certain weight   ∑ w = y j x j α j j = 1..  � So the classification function % ( sgn(  w ⋅  x j ⋅   ∑ x + b ) = sgn ' y j x + b * α j ' * & ) j = 1..  � Note that data only appears in the scalar product

Dual Representation for Learning � as well as the updating function x j ⋅   ∑ if y i ( y j x i + b ) ≤ 0 then α i = α i + η α j j = 1..  � The learning rate only affects the re-scaling of the η hyperplane, it does not affect the algorithm, so we can fix 1. η =

Dual Perceptron algorithm and Kernel functions We can rewrite the classification function as � h ( x ) = sgn(  φ ⋅ φ (  y j φ (  x j ) ⋅ φ (  ∑ w x ) + b φ ) = sgn( x ) + b φ ) = α j j = 1..  y j k (  x j ,  ∑ = sgn( x ) + b φ ) α j i = 1..  As well as the updating function � % ( y j k (  x j ,  ∑ if y i ' x i ) + b φ * * ≤ 0 allora α i = α i + η α j ' & ) j = 1..  The learning rate does not affect the algorithm so we set it to η � 1. η =

Dual optimization problem of SVMs

Kernels in Support Vector Machines � In Soft Margin SVMs we maximize: � By using kernel functions we rewrite the problem as:

Kernel Function Definition � Kernels are the product of mapping functions such as   (  ) = ( φ 1 (  ), φ 2 (  ),..., φ m (  x ∈ ℜ n , )) ∈ ℜ m x x x x φ

The Kernel Gram Matrix � With KM-based learning, the sole information used from the training data set is the Kernel Gram Matrix ! $ k ( x 1 , x 1 ) k ( x 1 , x 2 ) ... k ( x 1 , x m ) # & k ( x 2 , x 1 ) k ( x 2 , x 2 ) ... k ( x 2 , x m ) # & K training = # & ... ... ... ... # & k ( x m , x 1 ) k ( x m , x 2 ) ... k ( x m , x m ) # & " % � If the kernel is valid, K is symmetric definite-positive .

Valid Kernels

Valid Kernels cont’d � If the matrix is positive semi-definite then we can find a mapping φ implementing the kernel function

Mercer’s Theorem (finite space) K = K ( ! i , ! n ( ) i , j = 1 � Let us consider x x j ) � K symmetric ⇒ ∃ V: for Takagi factorization of a K = V " # V complex-symmetric matrix, where: � Λ is the diagonal matrix of the eigenvalues λ t of K ! n ( ) i = 1 v t = v ti � are the eigenvectors, i.e. the columns of V � Let us assume lambda values non-negative " : ! n ( ) t = 1 % & n , i = 1,.., n x i # $ t v ti

Mercer’s Theorem (sufficient conditions) � Therefore n Φ (  i ) ⋅ Φ (  ) ij = K ij = K (  i ,  , ∑ ( x x j ) = λ t v ti v tj = V Λ ' V x x j ) t = 1 � which implies that K is a kernel function

Mercer’s Theorem (necessary conditions) � Suppose we have negative eigenvalues λ s and ! eigenvectors the following point v s n n ! si " ( ! V ! ( ) t = # # z = v x i ) v $ t v ti v s % & = si i = 1 i = 1 � has the following norm: ! 2 = ! z " ! V ! V ! s = ! V ! z z = v v v s ' V # v # $ s # $ # $ s = ! ' K ! s = ! ! ! 2 < 0 v v v s ' % s v s = % s v s s this contradicts the geometry of the space.

Is it a valid kernel? � It may not be a kernel so we can use M´·M

Valid Kernel operations � k(x,z) = k 1 (x,z)+k 2 (x,z) � k(x,z) = k 1 (x,z)*k 2 (x,z) � k(x,z) = α k 1 (x,z) � k(x,z) = f(x)f(z) � k(x,z) = k 1 ( φ (x), φ (z)) � k(x,z) = x'Bz

Basic Kernels for unstructured data � Linear Kernel � Polynomial Kernel � Lexical kernel � String Kernel

Linear Kernel � In Text Categorization documents are word vectors " ( d x ) = ! x = (0,..,1,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,1) buy acquisition stocks sell market " ( d z ) = ! z = (0,..,1,..,0,..,1,..,0,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,0) buy company stocks sell ! ! x ! z � The dot product counts the number of features in common � This provides a sort of similarity

Feature Conjunction (polynomial Kernel) � The initial vectors are mapped in a higher space 2 2 ( x , x ) ( x , x , 2 x x , 2 x , 2 x , 1 ) Φ < > → 1 2 1 2 1 2 1 2 ( x 1 x ) � More expressive, as encodes 2 Stock+Market vs. Downtown+Market features � We can smartly compute the scalar product as   ( x ) ( z ) Φ ⋅ Φ = 2 2 2 2 ( x , x , 2 x x , 2 x , 2 x , 1 ) ( z , z , 2 z z , 2 z , 2 z , 1 ) = ⋅ = 1 2 1 2 1 2 1 2 1 2 1 2 2 2 2 2 x z x z 2 x x z z 2 x z 2 x z 1 = + + + + + = 1 1 2 2 1 2 1 2 1 1 2 2     2 2 ( x z x z 1 ) ( x z 1 ) K ( x , z ) = + + = ⋅ + = 1 1 2 2 Poly

Document Similarity Doc 2 Doc 1 industry company telephone product market

Lexical Semantic Kernel [CoNLL 2005] � The document similarity is the SK function: ∑ SK ( d 1 , d 2 ) = s ( w 1 , w 2 ) w 1 ∈ d 1 , w 2 ∈ d 2 � where s is any similarity function between words, e.g. WordNet [Basili et al.,2005] similarity or LSA [Cristianini et al., 2002] � Good results when training data is small

Using character sequences " (" bank ") = ! x = (0,..,1,..,0,..,1,..,0,......1,..,0,..,1,..,0,..,1,..,0) bank ank bnk bk b " (" rank ") = ! z = (1,..,0,..,0,..,1,..,0,......0,..,1,..,0,..,1,..,0,..,1) rank ank rnk rk r ! ! x ! z � counts the number of common substrings x " ! ! z = # (" bank ") " # (" rank ") = k (" bank "," rank ")

String Kernel � Given two strings, the number of matches between their substrings is evaluated � E.g. Bank and Rank � B, a, n, k, Ba, Ban, Bank, Bk, an, ank, nk,.. � R, a , n , k, Ra, Ran, Rank, Rk, an, ank, nk,.. � String kernel over sentences and texts � Huge space but there are efficient algorithms

Formal Definition i 1 + 1 , where , where

Kernel between Bank and Rank

An example of string kernel computation

Efficient Evaluation � Dynamic Programming technique � Evaluate the spectrum string kernels � Substrings of size p � Sum the contribution of the different spectra

Efficient Evaluation

An example: SK(“Gatta”,”Cata”) � First, evaluate the SK with size p=1, i.e. “a”, “a”,”t”,”t”,”a”,”a” � Store this in the table SK p = 1

Evaluating DP2 � Evaluate the weight of the string of size p in case a character will be matched � This is done by multiplying the double summation by the number of substrings of size p-1

Natural Language Processing and Information Retrieval Kernel - PowerPoint PPT Presentation

Natural Language Processing and Information Retrieval Kernel Methods Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Linear Classifier The equation of a

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Information Retrieval Natural Language Processing and Machine Leanring Advanced Natural Language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Cross-Language Information Retrieval Carol Peters ISTI-CNR, Pisa Cross-Language Information

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Lecture 5: Language Modelling in Information Retrieval and Classification Information Retrieval

Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval

Structured random measurements in compressed sensing Holger Rauhut Lehrstuhl C f ur

Low-Complexity Iterative Sinusoidal Parameter Estimation Jean-Marc Valin, Daniel V. Smith,

Using logarithms to solve Newtons Law of Cooling Recall Newtons Law of Cooling in which the

Solving exponential and logarithmic equations We explore some results involving exponential

Lecture 04 Reliable Communication I-Hsiang Wang ihwang@ntu.edu.tw National Taiwan University

On the frequencies of patterns of rises and falls Jean-Marc Luck Institut de Physique Th

Noise in SwitchedCapacitor Circuits 17 March 2014 Trevor Caldwell trevor.caldwell@analog.com

Hyperbolic surfaces, cutting sequences, and continued fractions Claire Merriman October 21, 2019