Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li - PowerPoint PPT Presentation

Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong, China Joint work with Zhihua Zhang and Dit-Yan Yeung Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 1 / 23

Contents 1 Introduction 2 Preliminaries Gaussian Processes Wishart Processes 3 Latent Wishart Processes Model Formulation Learning Out-of-Sample Extension 4 Relation to Existing Work 5 Experiments 6 Conclusion and Future Work Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 2 / 23

Introduction Relational Learning Traditional machine learning models: Assumption: i.i.d. Advantage: simple Many real-world applications: Relational: instances are related (linked) to each other Autocorrelation: statistical dependency between the values of a random variable on related objects (non i.i.d.) E.g., web pages, protein-protein interaction data Relational learning: An emerging research area attempting to represent, reason, and learn in domains with complex relational structure [Getoor & Taskar, 2007]. Application areas: Web mining, social network analysis, bioinformatics, marketing, etc. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 3 / 23

Introduction Relational Kernel Learning Kernel function: To characterize the similarity between data instances: K ( x i , x j ) e.g., K (cat , tiger) > K (cat , elephant) Positive semidefiniteness (p.s.d.) Kernel learning: To learn an appropriate kernel matrix or kernel function for a kernel-based learning method. Relational kernel learning (RKL): To learn an appropriate kernel matrix or kernel function for relational data by incorporating relational information between instances into the learning process. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 4 / 23

Preliminaries Gaussian Processes Stochastic Processes and Gaussian Processes Stochastic processes: A stochastic process (or random process) y ( x ) is specified by giving the joint distribution for any finite set of instances { x 1 , . . . , x n } in a consistent manner. Gaussian processes: A Gaussian process is a distribution over functions y ( x ) s.t. the values of y ( x ) evaluated at an arbitrary set of points { x 1 , . . . , x n } jointly have a Gaussian distribution. Assuming y ( x ) has zero mean, the specification of a Gaussian process is completed by giving the covariance function of y ( x ) evaluated at any two values of x , given by the kernel function K ( · , · ): E [ y ( x i ) y ( x j )] = K ( x i , x j ) . Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 5 / 23

Preliminaries Wishart Processes Wishart Processes Wishart distribution: An n × n random symmetric positive definite matrix A is said to have a Wishart distribution with parameters n , q , and n × n scale matrix Σ ≻ 0, written as A ∼ W n ( q , Σ ), if its p.d.f. is given by | A | ( q − n − 1) / 2 − 1 � � 2 tr ( Σ − 1 A ) 2 qn / 2 Γ n ( q / 2) | Σ | q / 2 exp , q ≥ n . Here Σ ≻ 0 means that Σ is positive definite (p.d.). Wishart processes: Given an input space X = { x 1 , x 2 , . . . } , the kernel function { A ( x i , x j ) | x i , x j ∈ X} is said to be a Wishart process (WP) if for any n ∈ N and { x 1 , . . . , x n } ⊆ X , the n × n random matrix A = [ A ( x i , x j )] n i , j =1 follows a Wishart distribution. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 6 / 23

Preliminaries Wishart Processes Relationship between GP and WP For any kernel function A : X × X → R , there exists a function B : X → F s.t. A ( x i , x j ) = B ( x i ) ′ B ( x j ), where X is the input space and F ⊂ R q is some latent (feature) space (in general the feature space may also be infinite-dimensional). Our previous result: A ( x i , x j ) is a Wishart process iff { B k ( x ) } q k =1 are q mutually independent Gaussian processes. i , j =1 and B = [ B ( x 1 ) , . . . , B ( x n )] ′ = [ b 1 , . . . , b n ] ′ . Let A = [ A ( x i , x j )] n Then b i are the latent vectors, and A = BB ′ is a linear kernel in the latent space but is a nonlinear kernel w.r.t. the input space. Theorem Let Σ be an n × n positive definite matrix. Then A is distributed according to the Wishart distribution W n ( q , Σ ) if and only if B is distributed according to the (matrix-variate) Gaussian distribution N n , q ( 0 , Σ ⊗ I q ) . Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 7 / 23

Preliminaries Wishart Processes GP and WP in a Nutshell Gaussian distribution: Each sampled instance is a finite-dimensional vector, v = ( v 1 , . . . , v d ) ′ . Wishart distribution: Each sampled instance is a finite-dimensional p.s.d. matrix, M � 0. Gaussian process: Each sampled instance is an infinite-dimensional function, f ( · ). Wishart process: Each sampled instance is an infinite-dimensional p.s.d. function, g ( · , · ). Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 8 / 23

Latent Wishart Processes Model Formulation Relational Data { ( x i , y i , z ik ) | i , k = 1 , . . . , n } ( x j , y j ) z ij = 1 ( x i , y i ) z il = 0 ( x l , y l ) z ik = 1 ( x k , y k ) x i = ( x i 1 , . . . , x ip ) ′ : input feature vector for instance i y i : label for instance i z ik = 1 if there exists a link between x i and x k ; 0 otherwise. z ik = z ki and z ii = 0. Z = [ z ik ] n i , k =1 . Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 9 / 23

Latent Wishart Processes Model Formulation LWP Model Goal: To learn a target kernel function A ( x i , x k ) which takes both the input attributes and the relational information into consideration. LWP: Let a ik = A ( x i , x k ). Then A = [ a ik ] n i , k =1 is a latent p.s.d. matrix. We model A by a Wishart distribution W n ( q , Σ ), which implies that A ( x i , x k ) follows a Wishart process. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 10 / 23

Latent Wishart Processes Model Formulation LWP Model Prior: p ( A ) = W n ( q , β ( K + λ I )) , where K = [ K ( x i , x k )] n i , k =1 with K ( x i , x k ) being a kernel function defined on the input attributes, β > 0, and λ is a very small number to make Σ ≻ 0. Likelihood: n n exp( a ik / 2) � � s z ik ik (1 − s ik ) 1 − z ik p ( Z | A ) = with s ik = 1 + exp( a ik / 2) . i =1 k = i +1 Posterior: p ( A | Z ) ∝ p ( Z | A ) p ( A ) The input attributes and relational information are seamlessly integrated via the Bayesian approach. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 11 / 23

Latent Wishart Processes Learning Maximum A Posteriori (MAP) Estimation Optimization via MAP estimation: � � log p ( Z | A ) p ( A ) argmax A The theorem shows that finding the MAP estimate of A is equivalent to finding the MAP estimate of B . Hence, we maximize the following: � L ( B ) = log { p ( Z | B ) p ( B ) } = log p ( z ik | b i , b k ) + log p ( B ) i � = k � z ik b ′ − log(1 + exp( b ′ � ( K + λ I ) − 1 i b k i b k − 1 � � BB ′ � = )) + C 2 tr 2 2 β i � = k − 1 � � � � z ik b ′ i b k / 2 − log(1 + exp( b ′ σ ik b ′ = i b k / 2)) i b k + C , 2 i � = k i , k i , k =1 = ( K + λ I ) − 1 where [ σ ik ] n and C is a constant independent of B . β Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 12 / 23

Latent Wishart Processes Learning MAP Estimation Block quasi-Newton method to solve the maximization of L ( B ) w.r.t. B : Fisher score vector and Hessian matrix of L w.r.t. b i : ∂ L � = ( z ij − s ij − σ ij ) b j − σ ii b i ∂ b i j � = i ∂ 2 L = − 1 � s ij (1 − s ij ) b j b ′ j − σ ii I q � − H i . ∂ b i ∂ b ′ 2 i j � = i Update equations: � b i ( t +1) = b i ( t ) + γ H i ( t ) − 1 ∂ L � , i = 1 , . . . , n , � ∂ b i � B = B ( t ) where γ is the step size. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 13 / 23

Latent Wishart Processes Out-of-Sample Extension Embedding for Test Data � Z 11 � Σ 11 Z 12 � Σ 12 � Let Z = and Σ = , where Z 11 , Σ 11 are Z 21 Z 22 Σ 21 Σ 22 n 1 × n 1 matrices and Z 22 , Σ 22 are n 2 × n 2 matrices. The n 1 instances corresponding to Z 11 , Σ 11 are training data and the n 2 instances corresponding to Z 22 , Σ 22 are new test data. � A 11 � B 1 B ′ � B 1 B ′ � A 12 1 2 Similarly, we partition A = = , B 2 B ′ B 2 B ′ A 21 A 22 1 2 � B 1 � B = . B 2 Because B ∼ N n , q ( 0 , Σ ⊗ I q ), we have B 1 ∼ N n 1 , q ( 0 , Σ 11 ⊗ I q ) and Σ 21 Σ − 1 � � B 2 | B 1 ∼ N n 2 , q 11 B 1 , Σ 22 · 1 ⊗ I q , where Σ 22 · 1 = Σ 22 − Σ 21 Σ − 1 11 Σ 12 . Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 14 / 23

Relation to Existing Work Comparison with RGP [Chu et al. , 2007] and XGP [Silva et al. , 2008] RGP and XGP: Learn only one GP. p ( B | Z ) is itself a prediction function with B being a vector of function values for all input points. The learned kernel, which is the covariance matrix of the posterior distribution p ( B | Z ), is ( K − 1 + Π − 1 ) − 1 in RGP and ( K + Π ) in XGP, where Π is a kernel matrix capturing the link information. LWP: Learn multiple ( q ) GPs. Treat A = BB ′ as the learned kernel matrix. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 15 / 23

Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li - PowerPoint PPT Presentation

Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong, China Joint work with Zhihua Zhang and Dit-Yan Yeung Li, Zhang and Yeung

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

A CLT for Wishart Tensors Dan Mikulincer Weizmann Institute of Science 1 Wishart Tensors Let {

Chapter 2: Relational Model Chapter 2: Relational Model Structure of Relational Databases

Chapter 3: Relational Model Structure of Relational Databases Relational Algebra Tuple

Relational Algebra Relational Query Languages Recall: Query = Retrieval Program Language

Relational Algebra 1 / 39 Relational Algebra Relational model specifies stuctures and

Relational Query Languages (2) SQL and QBE Walid G. Aref Query Languages For The Relational

1 CPU Events: Interrupts and Exceptions CPU Events: Interrupts and Exceptions Protecting Entry

Chapter 8 Evaluation of Relational Operators Implementing the Relational Algebra Relational

Relational Calculus More declarative than relational algebra Foundation for query

RELATIONAL ALGEBRA CHAPTER 6 1 CHAPTER 6 OUTLINE Unary Relational Operations: SELECT and

Relational Data Model Hacettepe University Computer Engineering Department Outline 1. Relational

This Lecture The Relational Model Relational data structures Relations and Relational

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Demystifying Relational Latent Representations Sebastijan Dumani, Hendrik Blockeel DTAI, KU

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

AdaGeo: Adaptive Geometric Learning for Optimization and Sampling Gabriele Abbati 1 , Alessandra

Dream to Control: Learning Behaviors by Latent Imagination Danijar Hafner, Timothy Lillicrap,

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

in the presence of latent confounders and linear non-Gaussian SEMs Shohei Shimizu Osaka

Efficient Model Evaluation in the Search-Based Approach to Latent Structure Discovery Tao Chen,

Advanced CUDA: Overview of GPU Hardware John E. Stone Theoretical and Computational Biophysics

Roadmap Roadmap Distributed Data Mining: Why Bother? Distributed Data Mining: Why Bother?