Kernel methods for Network Analysis: An introduction Chiranjib Bhattacharyya Machine Learning lab Dept of CSA, IISc chiru@csa.iisc.ernet.in http://drona.csa.iisc.ernet.in/~chiru 13th Jan, 2013
Computational Biology Which super-family does this protein structure belongs to?
Multimedia Who are the actors?
Social Networks How can one run a succesful Ad-campaign on this network?
Data Representation as a vector
Data Representation as a vector
Data Representation as a vector
Data Representation as a vector
Data Representation as a vector f e a t u r e m a p
When we have Feature maps Linear Classifiers, Principal Component Analysis
Similarity maybe readily available Problem Feature maps are not readily available
Kernel functions- a formal notion of similarity functions Kernel functions are essentially similarity functions. One can easily generalize many existing algorithms using kernel functions.Sometimes called the kernel trick Kernels can help integrate different sources of data
Agenda 1 Kernel Trick SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces? 2 Mathematical Foundations RKHS, Representer theorem 3 Kernels on Graphs aka Networks Kernels on vertices of a Graph Kernels on graphs 4 Advanced Topics: Multiple Kernel Learning
1 Kernel Trick SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces? 2 Mathematical Foundations RKHS, Representer theorem 3 Kernels on Graphs aka Networks Kernels on vertices of a Graph Kernels on graphs 4 Advanced Topics: Multiple Kernel Learning
PART 1: KERNEL TRICK
1 Kernel Trick SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces? 2 Mathematical Foundations RKHS, Representer theorem 3 Kernels on Graphs aka Networks Kernels on vertices of a Graph Kernels on graphs 4 Advanced Topics: Multiple Kernel Learning
The problem of classification Given Training data D = { ( x i , y i ) | i = 1 ,..., m } observation x i class label y i ∈ {− 1 , 1 } Find A classifier f : X → {− 1 , 1 } . f ( x ) = sign ( w ⊤ x + b )
Regularized risk m 1 max ( 1 − y i ( w ⊤ x i + b ) , 0 ) 2 � w � 2 ∑ + min w , b C i = 1 � �� � � �� � Regularization Risk
Regularized risk m 1 max ( 1 − y i ( w ⊤ x i + b ) , 0 ) 2 � w � 2 ∑ + min w , b C i = 1 � �� � � �� � Regularization Risk The SVM formulation m 1 2 � w � 2 + C ∑ ξ i min w , b , ξ i = 1 subject to y i ( w ⊤ x i + b ) ≥ 1 − ξ i ξ i ≥ 0 ∀ i ∈ [ m ]
SVM formulation m α i − 1 α i α j y i y j x ⊤ ∑ 2 ∑ maximize α i x j i = 1 ij m ∑ subject to 0 ≤ α i , α i y i = 0 i = 1
SVM formulation m α i − 1 α i α j y i y j x ⊤ ∑ 2 ∑ maximize α i x j i = 1 ij m ∑ subject to 0 ≤ α i , α i y i = 0 i = 1 w = ∑ m i = 1 α i y i x i m α i y i x ⊤ ∑ f ( x ) = sign ( i x + b ) i = 1
C-SVM in feature spaces Let us work with a feature map, Φ ( x ) m maximize α − 1 α i α j y i y j Φ ( x i ) ⊤ Φ ( x j )+ 2 ∑ ∑ α i i = 1 ij subject to 0 ≤ α i , ∑ α i y i = 0 i m α i y i Φ ( x i ) ⊤ Φ ( x )+ b ) ∑ f ( x ) = sign ( i = 1 The dot product between any pair of examples computed in the feature space be denoted by K ( x , z ) = Φ ( x ) ⊤ Φ ( z )
C-SVM in feature spaces Let us work with a feature map, Φ ( x ) m maximize α − 1 2 ∑ ∑ α i α j y i y j K ( x i , x j )+ α i i = 1 ij subject to 0 ≤ α i , ∑ α i y i = 0 i m ∑ f ( x ) = sign ( α i y i K ( x i , x )+ b ) i = 1 The dot product between any pair of examples computed in the feature space be denoted by K ( x , z ) = Φ ( x ) ⊤ Φ ( z )
1 Kernel Trick SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces? 2 Mathematical Foundations RKHS, Representer theorem 3 Kernels on Graphs aka Networks Kernels on vertices of a Graph Kernels on graphs 4 Advanced Topics: Multiple Kernel Learning
Principal Component Analysis(PCA) Principal Directions Given X = [ x 1 ,..., x m ] find directions of maximum variance( Jollife 2002). The direction of maximum variance, v , is given by 1 mXX ⊤ v = λ v (assuming that Xe = 0) Define v = X α 1 mXX ⊤ X α = λ X α leading to the following eigenvalue problem 1 m K α = λα where ( K ) ij = ( X ⊤ X ) ij = x ⊤ i x j .
Nonlinear component analysis(Scholkopf et al. 1996) Compute PCA in feature spaces Replace x ⊤ i x j by Φ ( x i ) ⊤ Φ ( x j ) Principal component of x In input space In feature space v ⊤ x ∑ m i = 1 α i K ( x i , x )
We just need the dot product √ Let x ∈ IR 2 and Φ ( x ) = [ x 2 2 x 1 x 2 ] ⊤ 1 x 2 2 K ( x , z ) = Φ ( x ) ⊤ Φ ( z ) = x 2 2 = ( x ⊤ z ) 2 1 z 2 1 + 2 x 1 x 2 z 1 z 2 + x 2 2 z 2 � d + r − 1 � If K ( x , z ) = ( x ⊤ z ) r is a dot product in a feature space r corresponding to x , z ∈ IR d . If d = 256 , r = 4, the feature space size is 6 , 35 , 376. However if we know K one can still solve the SVM formulation without explicitly evaluating Φ
1 Kernel Trick SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces? 2 Mathematical Foundations RKHS, Representer theorem 3 Kernels on Graphs aka Networks Kernels on vertices of a Graph Kernels on graphs 4 Advanced Topics: Multiple Kernel Learning
Norms, Distances � � � Φ ( x ) � = � Φ ( x ) , Φ ( x ) � = K ( x , x ) Normalized features Φ ( x ) K ( x , z ) Φ ( x ) ⊤ ˆ ˆ K ( x , z ) = ˆ ˆ Φ ( x ) = Φ ( z ) = � � Φ ( x ) � K ( x , x ) K ( z , z ) Distances � Φ ( x ) − Φ ( z ) � 2 = ( Φ ( x ) − Φ ( z )) ⊤ ( Φ ( x ) − Φ ( z )) = K ( x , x )+ K ( z , z ) − 2 K ( x , z ) If Φ is normalized K ( x , x ) = 1 then � Φ ( x ) − Φ ( z ) � 2 = 2 − 2 K ( x , z )
In the sequel Will formalize these notions conditions on K will be discussed K for graphs
1 Kernel Trick SVMs and Non-linear Classification Principal Component Analysis What can we compute with the dot product in feature spaces? 2 Mathematical Foundations RKHS, Representer theorem 3 Kernels on Graphs aka Networks Kernels on vertices of a Graph Kernels on graphs 4 Advanced Topics: Multiple Kernel Learning
Definition of Kernel functions
Kernel function Kernel function K : X × X → IR is a Kernel function if K ( x , z ) = K ( z , x ) symmetric K is positive semidefinite, i.e. ∀ n , x 1 ,..., x n ∈ X , the matrix K ij = K ( x i , x j ) is psd Recall that a K ∈ IR d × d is psd if u ⊤ K u ≥ 0 for all u ∈ IR d .
Examples of Kernel functions K ( x , z ) = φ ( x ) ⊤ φ ( z ) where φ : X → IR d K is symmetric i.e. K ( x , z ) = K ( z , x ) Positive Semidefinite: Let D = { x 1 , x 2 ,..., x n } be set of arbitrarily chosen n elements of X . Define K ij = φ ( x i ) ⊤ φ ( x j ) For any u ∈ IR n it is straightforward to see that m u ⊤ K u = � u i φ ( x i ) � 2 ∑ 2 ≥ 0 i = 1
Examples of Kernel functions K ( x , z ) = x ⊤ z Φ ( x ) = x � 2 ... x t d K ( x , z ) = ( x ⊤ z ) r t 1 ! t 2 ! .... t d ! x t 1 1 x t 2 r ! Φ t 1 t 2 ... t d ( x ) = d ∑ d i = 1 t i = r K ( x , z ) = e − γ � x − z � 2
Kernel Construction Let K 1 and K 2 be two valid kernels. K ( x , y ) = φ ( x ) ⊤ φ ( y ) K ( u , v ) = K 1 ( u , v ) K 2 ( u , v ) K = α K 1 + β K 2 α , β ≥ 0 K ( x , y ) ˆ K ( x , y ) = � � K ( x , x ) K ( y , y )
Kernel Construction Let K 1 and K 2 be two valid kernels. K ( x , y ) = φ ( x ) ⊤ φ ( y ) K ( x , y ) = x ⊤ y K ( x , y ) = ( x ⊤ y ) i K ( u , v ) = K 1 ( u , v ) K 2 ( u , v ) K = α K 1 + β K 2 α , β ≥ 0 ( x ⊤ y ) i N = e x ⊤ y ∑ K ( x , y ) = lim K ( x , y ) i ! N → ∞ i = 0 ˆ K ( x , y ) = � � K ( x , x ) K ( y , y ) K ( x , y ) = e − 1 2 � x − y � 2 ˆ
Kernel function and feature map A theorem due to Mercer guarantees a feature map for symmetric, psd kernel functions. Loosely stated For a symmetric kernel K : X × X → IR, there exists an expansion K ( x , z ) = Φ ( x ) ⊤ Φ ( z ) iff � X g ( x ) g ( z ) K ( x , z ) dxdz ≥ 0
What is a Dot product(aka Inner Product) Let X be a vector space. What is a Dot product Symmetry < u , v > = < v , u > u , v ∈ X Bilinear < α u + β v , w > = α < u , w > + β < v , w > u , v , w , ∈ X Positive Semidefinite < u , u > ≥ 0 u ∈ X < u , u > = 0 iff u = 0 Norm � � x � = � x , x � � x � = 0 = ⇒ x = 0
Examples of Dot products X = IR n ,< u , v > = u ⊤ v n X = IR n ,< u , v > = ∑ λ i u i v i λ i ≥ 0 i = 1 � X f ( x ) 2 dx < ∞ } X = L 2 ( X ) = { f : � f , g ∈ X < f , g > = X f ( x ) g ( x ) dx
Recommend
More recommend