estimation of the kernel mean embedding with uncertainty
play

Estimation of the Kernel Mean Embedding (with uncertainty) Paul - PowerPoint PPT Presentation

Estimation of the Kernel Mean Embedding (with uncertainty) Paul Rubenstein University of Cambridge Max-Planck Institute for Intelligent Systems, Tbingen 20th January 2016 RKHS theory A function k : X X R is a kernel if given x 1 ,


  1. Estimation of the Kernel Mean Embedding (with uncertainty) Paul Rubenstein University of Cambridge Max-Planck Institute for Intelligent Systems, Tübingen 20th January 2016

  2. RKHS theory A function k : X × X − → R is a kernel if given x 1 , . . . , x n ∈ X , K ij = k ( x i , x k ) K is symmetric and positive semi-definite ( = is a valid covariance matrix) Associated to k are: ◮ A Hilbert space H of functions X − → R ◮ A ‘feature map’ φ : X − → R such that k ( x, x ′ ) = � φ ( x ) , φ ( x ′ ) � 2 of 11

  3. RKHS theory Suppose we are given: ◮ A random variable X ∼ P taking value in X ◮ A function f : X − → R � and that we want to evaluate f ( x ) d P ( x ) = E X f ( X ) . If f ∈ H , then E X f ( X ) = E X � f, φ ( X ) � = � f, E X φ ( X ) � So if we know the mean embedding of X , µ X := E X φ ( X ) , then we can calculate expectations of any function in H by taking an product. 3 of 11

  4. RKHS theory For certain k , the mapping P �→ µ P is injective, ie P = Q ⇐ ⇒ µ P = µ Q We can exploit this to construct statistical tests of properties of distributions. Two sample test: Given { X i } ∼ P , { Y i } ∼ Q , does P = Q ? Idea: estimate µ P , µ Q and see how different they are Independence testing: Given { ( X i , Y i ) } ∼ P XY does P XY = P X P Y ? Idea: estimate P XY , P X P Y and see how different they are 4 of 11

  5. Estimating µ X � k ( x, · ) d P ( x ) How do we estimate µ X ? µ X = E φ ( X ) = If { X i } n i =1 ∼ P , then n Emprical mean embedding 0.7 µ := 1 � ˆ k ( X i , · ) − → µ X 0.6 n i =1 0.5 0.4     k ( X 1 , · ) 1 /n mu(x) . . 0.3 . . 1 n =  , Φ =     . .    0.2 1 /n k ( X n , · ) 0.1 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 x µ = 1 ⊺ ˆ n Φ 5 of 11

  6. Estimating µ X In Muandet et al 2015(?) (Kernel Mean Shrinkage Estimators), the risk of an estimator ˆ µ is defined: µ − µ � 2 ∆ = E � ˆ H and estimators that minimise ∆ are sought. Two proposals: For particular α that can be For λ estimated (by cross estimated from observations, validation) from observations, µ λ = Φ ⊺ ( K + λI ) − 1 K 1 n µ α = (1 − α )ˆ ˆ µ ˆ = (1 − α ) 1 ⊺ n Φ (this looks like GP regression) 6 of 11

  7. Bayesian estimation of µ X µ λ = Φ ⊺ ( K + λI ) − 1 K 1 n µ α = (1 − α ) 1 ⊺ ˆ n Φ ˆ Kernel Ridge Regression ⇐ ⇒ MAP inference in GP regression. Can we show that these estimators are the MAP solution to a Bayesian inference problem? µ ∼ GP (0 , k ) µ ∼ GP (0 , k ) µ | µ ∼ GP ( µ, γk ) ˆ µ | µ ∼ GP ( µ, λ I x = x ′ ) ˆ Define ‘pseudo-targets’ ˆ µ ( x ) = K 1 n and then perform Bayesian inference 7 of 11

  8. Deriving µ = (1 − α )Φ 1 n Consider µ ∼ GP (0 , k ) , ˆ µ | µ ∼ GP ( µ, γk ) � µ ( z ) � Choose a previously unobserved z and consider distribution of µ ( x ) ˆ k ⊺ � µ ( z ) � � � k zz �� z ∼ N 0 , µ ( x ) ˆ k z (1 + γ ) K � � 1 1 z K − 1 k z = ⇒ µ ( z ) | ˆ µ ( x ) ∼ N 1 + γ k ⊺ z 1 n , k zz − 1 + γ k ⊺ 1 α So if 1+ γ = (1 − α ) ⇐ ⇒ γ = 1 − α then MAP solution is µ = (1 − α )Φ 1 n 8 of 11

  9. µ λ = Φ ⊺ ( K + λI ) − 1 K 1 n Deriving ˆ Considering next µ ∼ GP (0 , k ) , ˆ µ | µ ∼ GP ( µ, λ I x = x ′ ) � µ ( z ) � � � k zz k ⊺ �� z ∼ N 0 , µ ( x ) ˆ k z K + λI z ( K + λI ) − 1 K 1 n , k zz − k ⊺ z ( K + λI ) − 1 k z � k ⊺ � = ⇒ µ ( z ) | ˆ µ ( x ) ∼ N Thus the MAP solution is µ = Φ ⊺ ( K + λI ) − 1 K 1 n 9 of 11

  10. Some problems Although we derive the same solution, most of the approach taken in the above doesn’t really make sense: ◮ The prior over µ is not sensible ◮ The likelihood ˆ µ is wrong - in fact, for large n , µ ≈ GP ( µ, 1 n [ C XX − µ X ⊗ µ X )] ˆ ◮ Uncertainty does not decay far away from observations as n grows. 10 of 11

  11. Thanks! Discussion? 11 of 11

Recommend


More recommend