information diffusion kernels
play

Information diffusion kernels Based on the technical report by John - PowerPoint PPT Presentation

T-122.102 Special Course in Information Technology Information diffusion kernels Based on the technical report by John Lafferty and Guy Lebanon, (2004) Diffusion Kernels on Statistical Manifolds (CMU-CS-04-101) Sven Laur Helsinki University of


  1. T-122.102 Special Course in Information Technology Information diffusion kernels Based on the technical report by John Lafferty and Guy Lebanon, (2004) Diffusion Kernels on Statistical Manifolds (CMU-CS-04-101) Sven Laur Helsinki University of Technology swen@math.ut.ee,slaur@tcs.hut.fi Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 1

  2. Outline • The problem and motivation • From data to distribution • What is a reasonable geometry over the distributions? ⋆ Coordinates, tangent vectors, distances etc. • Why heat diffusion? ⋆ Geodesic distance vs . Mercer kernel, Gaussian kernels. • Building a model • Extracting an approximate kernel Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 2

  3. How to build kernels for discrete data structures? • Simple embedding of discrete vectors to R n ⋆ Works with vectors of fixed length ⋆ It is ad hoc technique • Embedding via generative models ⋆ Theoretically sound ⋆ What should be the right proximity measure? ⋆ Proximity measure should be independent of parameterization! Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 3

  4. Parameterization invariant kernel methods • Fisher kernels K ( x , y ) = �∇ ℓ ( x | θ ) , ∇ ℓ ( y | θ ) � • Information diffusion kernels K ( x , y ) = ??? • Mutual information kernels (Bayesian prediction probability) � K ( x , y ) = Pr [ y | x ] ∝ p ( y | θ ) p ( x | θ ) p ( θ ) dθ integrated over model class P with prior probability p ( θ ) . Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 4

  5. Text classification • Bag of word approach produces a count vector ( x 1 , . . . , x n ) • Let the model class be a multinomial distribution. • MLE estimate is 1 � θ tf ( x ) = ( x 1 , . . . , x n ) . x 1 + · · · + x n • Second embedding is inverse document frequency weighting 1 � θ tfidf ( x ) = ( x 1 w i , . . . , x n w n ) x 1 w i + · · · + x n w n w i = log(1 /f i ) Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 5

  6. What is a statistical manifold? • Statistical manifold is a family of probability distributions P = { p ( ·| θ ) : X → R : θ ∈ Θ } , where Θ is open subset of R n . • The parameterization must be unique p ( ·| θ 1 ) ≡ p ( ·| θ 2 ) = ⇒ θ 1 = θ 2 • Parameters θ can be treated as the coordinate vector of p ( ·| θ ) Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 6

  7. Set of admissible coordinates and distributions • The parameterization ψ is admissible iff ψ as a function of primary parameters θ is C ∞ smooth. • The set of admissible parameterization is an invariant. • We consider only such manifolds where log-likelihood function ℓ ( x | θ ) = log p ( x | θ ) is C ∞ differentiable w.r.t θ . • The multinomial family satisfies the C ∞ requirement m m � � ℓ ( x | θ ) = log θ x j = log θ x j . j =1 j =1 Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 7

  8. Geometry ≈ distance measure • Distance measure determines geometry. This can be reversed. • Recall that the length of a path γ : [0 , 1] → P � 1 1 � � d ( p, q ) = � ˙ γ ( t ) � dt = � ˙ γ ( t ) , ˙ γ ( t ) � dt, 0 0 where ˙ γ ( t ) is a tangent vector. • But the set P does not have any geometrical structure!!! • We redefine (tangent) vectors—vectors will be operators. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 8

  9. What is a vector? • Vector will be operator that maps C ∞ functions f : P → R to reals. For fixed coordinates θ and point p natural maps ( ∂ ∂θ i ) p emerge � ∂ � � ( f ) = ∂f � � . � p ∂θ i ∂θ i p They will be basis of tangent space. • For arbitrary differentiable γ we can express � � � � � � f ( γ ( t )) ′ = θ 1 ( t ) ′ ∂ ∂ + · · · θ n ( t ) ′ ( f ) . ∂θ 1 ∂θ n γ ( t ) γ ( t ) The operator in the square brackets does not depend on f and has right type—it will be a speed/tangent vector. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 9

  10. Is this a reasonable definition? • The speed vector ˙ γ ( t ) uniquely characterizes the rate of change of arbitrary admissible function f γ ( t )( f ) = f ( γ ( t )) ′ ˙ t • There is a one-to-one correspondence θ ( ˙ θ n ( t )) ∈ R n . θ 1 ( t ) , . . . , ˙ γ ( t ) �− ˙ → • The are coordinate transformation formulas between different bases � ∂ � ∂ � n � n and ∂θ i ∂ψ i i =1 i =1 • We really cannot expect more, if there is no geometrical structure!!! Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 10

  11. Kullback-Leibler divergence • The most reasonable distance measure between adjacent distribu- tions p and q is the weighted Kullback-Leibler divergence J ( p, q ) = D p � q + D q � p � � p ( x ) log p ( x ) p ( x ) log p ( x ) = q ( x ) d x + q ( x ) d x , • It quantifies additional utility if we use wrong distribution. • In discrete case it means that we need J ( p, q ) times more bits for encoding. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 11

  12. What is a reasonable distance metrics? Consider an infinitesimal movement along the curve γ ( t ) . • The corresponding change of coordinates is from θ to θ + ˙ θ ∆ t and the distance formula gives � ∂ � n � , ∂ γ ( t ) � 2 = ∆ t 2 d ( p, q ) 2 ≈ ∆ t 2 � ˙ θ i ˙ ˙ θ j ∂θ i ∂θ j i,j =1 • Under mild regularity conditions � n p ( x ) · ∂ℓ ( x | θ ) · ∂ℓ ( x | θ ) � J ( p, q ) ≈ ∆ t 2 θ i ˙ ˙ θ j g ij , g ij = d x . ∂θ i ∂θ j i,j =1 • Hence, the local requirement d 2 ( p, q ) ≈ J ( p, q ) fixes geometry � � ∂θ i , ∂ ∂ = g ij . ∂θ j Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 12

  13. Limitations of geodesic distance • Geodesic distance d ( p, q ) is the shortest path between p and q . • Geodesic distance cannot be always used for SVM kernels ⋆ SVM kernel (Mercer kernel) is a computational shortcut of K ( x , y ) = Ψ( x ) · Ψ( y ) , where Ψ : R n → R d is a smooth enough function. ⋆ If geodesic distance corresponds to a Mercer kernel then there must be only one shortest path between two points. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 13

  14. Classification via temperature • Consider two classes ”hot” and ”cold”, i.e. each data point has a an initial amount of heat λ i concentrated around a small neighborhood. • All other points have zero temperature. • Fix a time moment t . All points below zero belong to the class ”cold” and others to the class ”hot”. • Heat gradually diffuses over the manifold. If t → ∞ all points have constant temperature. Varying t gives different levels of smoothing. • Large t gives flatter decision border that is classification is more robust, but also a less sensitive. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 14

  15. How to model heat diffusion? • Classical heat diffusion is given by partial differential equations ∂f ∂t − ∆ f = 0 f ( x, 0) = f ( x ) and by Dirichlet’ or von Neumann boundary conditions. • In non-Euclidean geometry Laplace operator has a nasty form � � n � ∂ g ij det G 1 / 2 ∂f ∆ f = det G − 1 / 2 ∂θ j ∂θ i i,j =1 where g ij are elements of inverse Fisher matrix G . Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 15

  16. Extracting the kernel • In the Euclidean space R n ∆ f = ∂ 2 f + · · · + ∂ 2 f . ∂x 2 ∂x 2 n 1 • The solution corresponding to initial condition f ( x ) � � � −� x − y � 2 f ( x , t ) = (4 π ) − n/ 2 exp f ( y ) d y 4 t • Alternatively � � � −� x − y � 2 f ( x , t ) = K t ( x , y ) f ( y ) d y K t ( x , y ) = exp 4 t • In SVM-s f = λ 1 δ x 1 + · · · + λ k δ x k and integral collapses to a sum. Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 16

  17. Central theoretical result Theorem Let M be a complete Riemannian manifold. Then there exists a kernel function K (heat kernel), which satisfies the following properties: (1) K ( x , y , t ) = K ( y , x , t ) ; (2) lim t → 0 K ( x , y , t ) = δ ( x , y ) ; (3) (∆ − ∂ ∂t ) K ( x , y , t ) = 0 ; � K ( x , z , t − s ) K ( z , y , s ) d z . (4) K ( x , y , t ) = The assertion means : (1) if q converges parameter-wise p then J ( p, q ) → 0 ; Special Course in Information Technology, 30.03.2004 Information diffusion kernels, Sven Laur 17

Recommend


More recommend