An Introduction to Hilbert Space Embedding of Probability Measures Krikamol Muandet Max Planck Institute for Intelligent Systems T¨ ubingen, Germany Jeju, South Korea, February 22, 2019 1/34
Reference Kernel Mean Embedding of Distributions: A Review and Beyond M , Fukumizu, Sriperumbudur, and Sch¨ olkopf . FnT ML, 2017. 2/34
From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions 3/34
From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions 4/34
Classification Problem Data in Input Space 1.0 +1 -1 0.5 0.0 x 2 −0.5 −1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 5/34
Data in Input Space Data in Feature Space 1.0 +1 +1 -1 -1 0.8 0.5 0.6 0.4 0.2 0.0 ϕ 3 0.0 x 2 −0.2 −0.4 −0.6 −0.8 −0.5 0.2 0.3 0.4 0.5 1.0 0.6 0.7 0.8 0.6 0.8 ϕ 2 −1.0 0.4 0.9 0.2 ϕ 1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , 6/34
Data in Feature Space +1 -1 0.8 0.6 0.4 0.2 0.0 ϕ 3 −0.2 −0.4 −0.6 −0.8 0.2 0.3 0.4 0.5 1.0 0.6 0.7 0.8 0.6 0.8 ϕ 2 0.4 0.9 0.2 ϕ 1 0.0 Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , Data in Input Space 1.0 +1 -1 0.5 0.0 x 2 −0.5 −1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 6/34
Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , Data in Input Space Data in Feature Space 1.0 +1 +1 -1 -1 0.8 0.5 0.6 0.4 0.2 0.0 ϕ 3 0.0 x 2 −0.2 −0.4 −0.6 −0.8 −0.5 0.2 0.3 0.4 0.5 0.6 1.0 0.7 0.8 0.6 0.8 ϕ 2 −1.0 0.4 0.9 0.2 ϕ 1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 6/34
Feature Map √ → ( x 2 1 , x 2 φ : ( x 1 , x 2 ) �− 2 x 1 x 2 ) 2 , Data in Input Space Data in Feature Space 1.0 +1 +1 -1 -1 0.8 0.5 0.6 0.4 0.2 0.0 ϕ 3 0.0 x 2 −0.2 −0.4 −0.6 −0.8 −0.5 0.2 0.3 0.4 0.5 0.6 1.0 0.7 0.8 0.6 0.8 ϕ 2 −1.0 0.4 0.9 0.2 ϕ 1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x 1 � φ ( x ) , φ ( x ′ ) � R 3 = ( x · x ′ ) 2 6/34
7/34
Our recipe: 1. Construct a non-linear feature map φ : X → H . 2. Evaluate D φ = { φ ( x 1 ) , φ ( x 2 ) , . . . , φ ( x n ) } . 3. Solve the learning problem in H using D φ . 7/34
Kernels Definition A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x , x ′ ∈ X we have k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H . We call φ a feature map and H a feature space of k . 8/34
Kernels Definition A function k : X ×X → R is called a kernel on X if there exists a Hilbert space H and a map φ : X → H such that for all x , x ′ ∈ X we have k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H . We call φ a feature map and H a feature space of k . Example 1. k ( x , x ′ ) = ( x · x ′ ) 2 for x , x ′ ∈ R 2 √ ◮ φ ( x ) = ( x 2 1 , x 2 2 x 1 x 2 ) 2 , ◮ H = R 3 2. k ( x , x ′ ) = ( x · x ′ + c ) m for c > 0 , x , x ′ ∈ R d � d + m ◮ dim ( H ) = � m � � 3. k ( x , x ′ ) = exp − γ � x − x ′ � 2 2 ◮ H = R ∞ 8/34
Positive Definite Kernels Definition (Positive definiteness) A function k : X × X → R is called positive definite if, for all n ∈ N , α 1 , . . . , α n ∈ R and all x 1 , . . . , x n ∈ X , we have � n � n α i α j k ( x j , x i ) ≥ 0 . i =1 j =1 Equivalently, we have that a Gram matrix K is positive definite. 9/34
Positive Definite Kernels Definition (Positive definiteness) A function k : X × X → R is called positive definite if, for all n ∈ N , α 1 , . . . , α n ∈ R and all x 1 , . . . , x n ∈ X , we have � n � n α i α j k ( x j , x i ) ≥ 0 . i =1 j =1 Equivalently, we have that a Gram matrix K is positive definite. Example (Any kernel is positive definite) Let k be a kernel with feature map φ : X → H , then we have � n � � n � n � � n ≥ 0 . α i α j k ( x j , x i ) = α i φ ( x i ) , α j φ ( x j ) i =1 j =1 i =1 j =1 H Positive definiteness is a necessary (and sufficient ) condition. 9/34
Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 10/34
Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 1. A function k : X × X → R is called a reproducing kernel of H if we have k ( · , x ) ∈ H for all x ∈ X and the reproducing property f ( x ) = � f , k ( · , x ) � holds for all f ∈ H and all x ∈ X . 10/34
Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 1. A function k : X × X → R is called a reproducing kernel of H if we have k ( · , x ) ∈ H for all x ∈ X and the reproducing property f ( x ) = � f , k ( · , x ) � holds for all f ∈ H and all x ∈ X . 2. The space H is called a reproducing kernel Hilbert space (RKHS) over X if for all x ∈ X the Dirac functional δ x : H → R defined by δ x ( f ) := f ( x ) , f ∈ H , is continuous. 10/34
Reproducing Kernel Hilbert Spaces Let H be a Hilbert space of functions mapping from X into R . 1. A function k : X × X → R is called a reproducing kernel of H if we have k ( · , x ) ∈ H for all x ∈ X and the reproducing property f ( x ) = � f , k ( · , x ) � holds for all f ∈ H and all x ∈ X . 2. The space H is called a reproducing kernel Hilbert space (RKHS) over X if for all x ∈ X the Dirac functional δ x : H → R defined by δ x ( f ) := f ( x ) , f ∈ H , is continuous. Remark: If � f n − f � H → 0 for n → ∞ , then for all x ∈ X , we have n →∞ f n ( x ) = f ( x ) lim . 10/34
Reproducing Kernels Lemma (Reproducing kernels are kernels) Let H be a Hilbert space over X with a reproducing kernel k. Then H is an RKHS and is also a feature space of k, where the feature map φ : X → H is given by φ ( x ) = k ( · , x ) , x ∈ X . We call φ the canonical feature map . 11/34
Reproducing Kernels Lemma (Reproducing kernels are kernels) Let H be a Hilbert space over X with a reproducing kernel k. Then H is an RKHS and is also a feature space of k, where the feature map φ : X → H is given by φ ( x ) = k ( · , x ) , x ∈ X . We call φ the canonical feature map . Proof We fix an x ′ ∈ X and write f := k ( · , x ′ ). Then, for x ∈ X , the reproducing property yields � φ ( x ′ ) , φ ( x ) � = � k ( · , x ′ ) , k ( · , x ) � = � f , k ( · , x ) � = f ( x ) = k ( x , x ′ ) . 11/34
Kernels and RKHSs Theorem (Every RKHS has a unique reproducing kernel) Let H be an RKHS over X . Then k : X × X → R defined by x , x ′ ∈ X k ( x , x ′ ) = � δ x , δ x ′ � H , is the only reproducing kernel of H . Furthermore, if ( e i ) i ∈ I is an orthonormal basis of H , then for all x , x ′ ∈ X we have � k ( x , x ′ ) = e i ( x ) e i ( x ′ ) . i ∈ I 12/34
Kernels and RKHSs Theorem (Every RKHS has a unique reproducing kernel) Let H be an RKHS over X . Then k : X × X → R defined by x , x ′ ∈ X k ( x , x ′ ) = � δ x , δ x ′ � H , is the only reproducing kernel of H . Furthermore, if ( e i ) i ∈ I is an orthonormal basis of H , then for all x , x ′ ∈ X we have � k ( x , x ′ ) = e i ( x ) e i ( x ′ ) . i ∈ I Universal kernels A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C ( X ), i.e., for every function g ∈ C ( X ) and all ε > 0 there exist an f ∈ H such that � f − g � ∞ ≤ ε. 12/34
From Points to Measures Input space X Feature space H k ( y , · ) y f x k ( x , · ) 13/34
From Points to Measures Input space X Feature space H k ( y , · ) y f x k ( x , · ) � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) 13/34
From Points to Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Future Directions 14/34
Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x 15/34
Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Definition Let P be a space of all probability measures on a measurable space ( X , Σ) and H an RKHS endowed with a reproducing kernel k : X × X → R . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . 15/34
Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Definition Let P be a space of all probability measures on a measurable space ( X , Σ) and H an RKHS endowed with a reproducing kernel k : X × X → R . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . Remark: For a Dirac measure δ x , δ x �→ µ [ δ x ] ≡ x �→ k ( · , x ). 15/34
Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then µ P ∈ H and E X ∼ P [ f ( X )] = � f , µ P � , f ∈ H . 16/34
Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then µ P ∈ H and E X ∼ P [ f ( X )] = � f , µ P � , f ∈ H . ◮ The kernel k is said to be characteristic if the map P �→ µ P is injective. That is, � µ P − µ Q � H = 0 if and only if P = Q . 16/34
Recommend
More recommend