Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong, China Joint work with Zhihua Zhang and Dit-Yan Yeung Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 1 / 23
Contents 1 Introduction 2 Preliminaries Gaussian Processes Wishart Processes 3 Latent Wishart Processes Model Formulation Learning Out-of-Sample Extension 4 Relation to Existing Work 5 Experiments 6 Conclusion and Future Work Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 2 / 23
Introduction Relational Learning Traditional machine learning models: Assumption: i.i.d. Advantage: simple Many real-world applications: Relational: instances are related (linked) to each other Autocorrelation: statistical dependency between the values of a random variable on related objects (non i.i.d.) E.g., web pages, protein-protein interaction data Relational learning: An emerging research area attempting to represent, reason, and learn in domains with complex relational structure [Getoor & Taskar, 2007]. Application areas: Web mining, social network analysis, bioinformatics, marketing, etc. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 3 / 23
Introduction Relational Kernel Learning Kernel function: To characterize the similarity between data instances: K ( x i , x j ) e.g., K (cat , tiger) > K (cat , elephant) Positive semidefiniteness (p.s.d.) Kernel learning: To learn an appropriate kernel matrix or kernel function for a kernel-based learning method. Relational kernel learning (RKL): To learn an appropriate kernel matrix or kernel function for relational data by incorporating relational information between instances into the learning process. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 4 / 23
Preliminaries Gaussian Processes Stochastic Processes and Gaussian Processes Stochastic processes: A stochastic process (or random process) y ( x ) is specified by giving the joint distribution for any finite set of instances { x 1 , . . . , x n } in a consistent manner. Gaussian processes: A Gaussian process is a distribution over functions y ( x ) s.t. the values of y ( x ) evaluated at an arbitrary set of points { x 1 , . . . , x n } jointly have a Gaussian distribution. Assuming y ( x ) has zero mean, the specification of a Gaussian process is completed by giving the covariance function of y ( x ) evaluated at any two values of x , given by the kernel function K ( · , · ): E [ y ( x i ) y ( x j )] = K ( x i , x j ) . Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 5 / 23
Preliminaries Wishart Processes Wishart Processes Wishart distribution: An n × n random symmetric positive definite matrix A is said to have a Wishart distribution with parameters n , q , and n × n scale matrix Σ ≻ 0, written as A ∼ W n ( q , Σ ), if its p.d.f. is given by | A | ( q − n − 1) / 2 − 1 � � 2 tr ( Σ − 1 A ) 2 qn / 2 Γ n ( q / 2) | Σ | q / 2 exp , q ≥ n . Here Σ ≻ 0 means that Σ is positive definite (p.d.). Wishart processes: Given an input space X = { x 1 , x 2 , . . . } , the kernel function { A ( x i , x j ) | x i , x j ∈ X} is said to be a Wishart process (WP) if for any n ∈ N and { x 1 , . . . , x n } ⊆ X , the n × n random matrix A = [ A ( x i , x j )] n i , j =1 follows a Wishart distribution. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 6 / 23
Preliminaries Wishart Processes Relationship between GP and WP For any kernel function A : X × X → R , there exists a function B : X → F s.t. A ( x i , x j ) = B ( x i ) ′ B ( x j ), where X is the input space and F ⊂ R q is some latent (feature) space (in general the feature space may also be infinite-dimensional). Our previous result: A ( x i , x j ) is a Wishart process iff { B k ( x ) } q k =1 are q mutually independent Gaussian processes. i , j =1 and B = [ B ( x 1 ) , . . . , B ( x n )] ′ = [ b 1 , . . . , b n ] ′ . Let A = [ A ( x i , x j )] n Then b i are the latent vectors, and A = BB ′ is a linear kernel in the latent space but is a nonlinear kernel w.r.t. the input space. Theorem Let Σ be an n × n positive definite matrix. Then A is distributed according to the Wishart distribution W n ( q , Σ ) if and only if B is distributed according to the (matrix-variate) Gaussian distribution N n , q ( 0 , Σ ⊗ I q ) . Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 7 / 23
Preliminaries Wishart Processes GP and WP in a Nutshell Gaussian distribution: Each sampled instance is a finite-dimensional vector, v = ( v 1 , . . . , v d ) ′ . Wishart distribution: Each sampled instance is a finite-dimensional p.s.d. matrix, M � 0. Gaussian process: Each sampled instance is an infinite-dimensional function, f ( · ). Wishart process: Each sampled instance is an infinite-dimensional p.s.d. function, g ( · , · ). Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 8 / 23
Latent Wishart Processes Model Formulation Relational Data { ( x i , y i , z ik ) | i , k = 1 , . . . , n } ( x j , y j ) z ij = 1 ( x i , y i ) z il = 0 ( x l , y l ) z ik = 1 ( x k , y k ) x i = ( x i 1 , . . . , x ip ) ′ : input feature vector for instance i y i : label for instance i z ik = 1 if there exists a link between x i and x k ; 0 otherwise. z ik = z ki and z ii = 0. Z = [ z ik ] n i , k =1 . Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 9 / 23
Latent Wishart Processes Model Formulation LWP Model Goal: To learn a target kernel function A ( x i , x k ) which takes both the input attributes and the relational information into consideration. LWP: Let a ik = A ( x i , x k ). Then A = [ a ik ] n i , k =1 is a latent p.s.d. matrix. We model A by a Wishart distribution W n ( q , Σ ), which implies that A ( x i , x k ) follows a Wishart process. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 10 / 23
Latent Wishart Processes Model Formulation LWP Model Prior: p ( A ) = W n ( q , β ( K + λ I )) , where K = [ K ( x i , x k )] n i , k =1 with K ( x i , x k ) being a kernel function defined on the input attributes, β > 0, and λ is a very small number to make Σ ≻ 0. Likelihood: n n exp( a ik / 2) � � s z ik ik (1 − s ik ) 1 − z ik p ( Z | A ) = with s ik = 1 + exp( a ik / 2) . i =1 k = i +1 Posterior: p ( A | Z ) ∝ p ( Z | A ) p ( A ) The input attributes and relational information are seamlessly integrated via the Bayesian approach. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 11 / 23
Latent Wishart Processes Learning Maximum A Posteriori (MAP) Estimation Optimization via MAP estimation: � � log p ( Z | A ) p ( A ) argmax A The theorem shows that finding the MAP estimate of A is equivalent to finding the MAP estimate of B . Hence, we maximize the following: � L ( B ) = log { p ( Z | B ) p ( B ) } = log p ( z ik | b i , b k ) + log p ( B ) i � = k � z ik b ′ − log(1 + exp( b ′ � ( K + λ I ) − 1 i b k i b k − 1 � � BB ′ � = )) + C 2 tr 2 2 β i � = k − 1 � � � � z ik b ′ i b k / 2 − log(1 + exp( b ′ σ ik b ′ = i b k / 2)) i b k + C , 2 i � = k i , k i , k =1 = ( K + λ I ) − 1 where [ σ ik ] n and C is a constant independent of B . β Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 12 / 23
Latent Wishart Processes Learning MAP Estimation Block quasi-Newton method to solve the maximization of L ( B ) w.r.t. B : Fisher score vector and Hessian matrix of L w.r.t. b i : ∂ L � = ( z ij − s ij − σ ij ) b j − σ ii b i ∂ b i j � = i ∂ 2 L = − 1 � s ij (1 − s ij ) b j b ′ j − σ ii I q � − H i . ∂ b i ∂ b ′ 2 i j � = i Update equations: � b i ( t +1) = b i ( t ) + γ H i ( t ) − 1 ∂ L � , i = 1 , . . . , n , � ∂ b i � B = B ( t ) where γ is the step size. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 13 / 23
Latent Wishart Processes Out-of-Sample Extension Embedding for Test Data � Z 11 � Σ 11 Z 12 � Σ 12 � Let Z = and Σ = , where Z 11 , Σ 11 are Z 21 Z 22 Σ 21 Σ 22 n 1 × n 1 matrices and Z 22 , Σ 22 are n 2 × n 2 matrices. The n 1 instances corresponding to Z 11 , Σ 11 are training data and the n 2 instances corresponding to Z 22 , Σ 22 are new test data. � A 11 � B 1 B ′ � B 1 B ′ � A 12 1 2 Similarly, we partition A = = , B 2 B ′ B 2 B ′ A 21 A 22 1 2 � B 1 � B = . B 2 Because B ∼ N n , q ( 0 , Σ ⊗ I q ), we have B 1 ∼ N n 1 , q ( 0 , Σ 11 ⊗ I q ) and Σ 21 Σ − 1 � � B 2 | B 1 ∼ N n 2 , q 11 B 1 , Σ 22 · 1 ⊗ I q , where Σ 22 · 1 = Σ 22 − Σ 21 Σ − 1 11 Σ 12 . Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 14 / 23
Relation to Existing Work Comparison with RGP [Chu et al. , 2007] and XGP [Silva et al. , 2008] RGP and XGP: Learn only one GP. p ( B | Z ) is itself a prediction function with B being a vector of function values for all input points. The learned kernel, which is the covariance matrix of the posterior distribution p ( B | Z ), is ( K − 1 + Π − 1 ) − 1 in RGP and ( K + Π ) in XGP, where Π is a kernel matrix capturing the link information. LWP: Learn multiple ( q ) GPs. Treat A = BB ′ as the learned kernel matrix. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 15 / 23
Recommend
More recommend