RKHS as Feature Space Reproducing kernels are kernels k ( x , x ′ ) Let H be a Hilbert space on X with a reproducing kernel k . Then, H is an �· , ·� RKHS and is also a feature space of k , where the feature map φ : X → H is given φ x , x ′ φ ( x ) , φ ( x ′ ) by φ ( x ) = k ( · , x ) . We call φ the canonical feature map . Proof We fix an x ′ ∈ X and write f := k ( · , x ′ ). Then, for x ∈ X , the reproducing property implies � φ ( x ′ ) , φ ( x ) � = � k ( · , x ′ ) , k ( · , x ) � = � f , k ( · , x ) � = f ( x ) = k ( x , x ′ ) . 12/53
RKHS as Feature Space Universal kernels (Steinwart 2002) A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C ( X ), i.e., for every function g ∈ C ( X ) and all ε > 0 there exist an f ∈ H such that � f − g � ∞ ≤ ε. 13/53
RKHS as Feature Space Universal kernels (Steinwart 2002) A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C ( X ), i.e., for every function g ∈ C ( X ) and all ε > 0 there exist an f ∈ H such that � f − g � ∞ ≤ ε. Universal approximation theorem (Cybenko 1989) Given any ε > 0 and f ∈ C ( X ), there exist � n α i ϕ ( w ⊤ h ( x ) = i x + b i ) i =1 such that | f ( x ) − h ( x ) | < ε for all x ∈ X . 13/53
Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H 14/53
Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H of functions on X for which k is a reproducing kernel : k ( x , x ′ ) = � k ( · , x ) , k ( · , x ′ ) � H . f ( x ) = � f , k ( · , x ) � H , 14/53
Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H of functions on X for which k is a reproducing kernel : k ( x , x ′ ) = � k ( · , x ) , k ( · , x ′ ) � H . f ( x ) = � f , k ( · , x ) � H , ◮ Implicit representation of data points : ◮ Support vector machine (SVM) ◮ Gaussian process (GP) ◮ Neural tangent kernel (NTK) 14/53
Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H of functions on X for which k is a reproducing kernel : k ( x , x ′ ) = � k ( · , x ) , k ( · , x ′ ) � H . f ( x ) = � f , k ( · , x ) � H , ◮ Implicit representation of data points : ◮ Support vector machine (SVM) ◮ Gaussian process (GP) ◮ Neural tangent kernel (NTK) ◮ Good references on kernel methods. ◮ Support vector machine (2008), Christmann and Steinwart. ◮ Gaussian process for ML (2005), Rasmussen and Williams. ◮ Learning with kernels (1998), Sch¨ olkopf and Smola. 14/53
Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development 15/53
Probability Measures Learning on Group Anomaly/OOD Distributions/Point Clouds Detection e c a p s n o i t u b i t r s i D e c a p s u t p n I Generalization across Statistical and Causal Inference Environments X Y P ... P 1 P 2 P N P X p ( x ) XY XY XY P Q ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n training data unseen test data x 16/53
Embedding of Dirac Measures Input Space X Feature Space H k ( y , · ) y f x k ( x , · ) 17/53
Embedding of Dirac Measures Input Space X Feature Space H k ( y , · ) y f x k ( x , · ) � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) = k ( · , x ) 17/53
Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development 18/53
Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Probability measure Let P be a probability measure defined on a measurable space ( X , Σ) with a σ -algebra Σ. 19/53
Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Probability measure Let P be a probability measure defined on a measurable space ( X , Σ) with a σ -algebra Σ. Kernel mean embedding Let P be a space of all probability measures P . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . 19/53
Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Probability measure Let P be a probability measure defined on a measurable space ( X , Σ) with a σ -algebra Σ. Kernel mean embedding Let P be a space of all probability measures P . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . Remark: The kernel k is Bochner integrable if it is bounded . 19/53
Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then for µ P ∈ H and f ∈ H , � f , µ P � = � f , E X ∼ P [ k ( · , X )] � = E X ∼ P [ � f , k ( · , X ) � ] = E X ∼ P [ f ( X )] . 20/53
Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then for µ P ∈ H and f ∈ H , � f , µ P � = � f , E X ∼ P [ k ( · , X )] � = E X ∼ P [ � f , k ( · , X ) � ] = E X ∼ P [ f ( X )] . ◮ The kernel k is said to be characteristic if the map P �→ µ P is injective , i.e., � µ P − µ Q � H = 0 if and only if P = Q . 20/53
Interpretation of Kernel Mean Representation What properties are captured by µ P ? ◮ k ( x , x ′ ) = � x , x ′ � the first moment of P ◮ k ( x , x ′ ) = ( � x , x ′ � + 1) p moments of P up to order p ∈ N ◮ k ( x , x ′ ) is universal/characteristic all information of P 21/53
Interpretation of Kernel Mean Representation What properties are captured by µ P ? ◮ k ( x , x ′ ) = � x , x ′ � the first moment of P ◮ k ( x , x ′ ) = ( � x , x ′ � + 1) p moments of P up to order p ∈ N ◮ k ( x , x ′ ) is universal/characteristic all information of P Moment-generating function Consider k ( x , x ′ ) = exp( � x , x ′ � ). Then, µ P = E X ∼ P [ e � X , ·� ]. 21/53
Interpretation of Kernel Mean Representation What properties are captured by µ P ? ◮ k ( x , x ′ ) = � x , x ′ � the first moment of P ◮ k ( x , x ′ ) = ( � x , x ′ � + 1) p moments of P up to order p ∈ N ◮ k ( x , x ′ ) is universal/characteristic all information of P Moment-generating function Consider k ( x , x ′ ) = exp( � x , x ′ � ). Then, µ P = E X ∼ P [ e � X , ·� ]. Characteristic function If k ( x , y ) = ψ ( x − y ) where ψ is a positive definite function, then � µ P ( y ) = ψ ( x − y ) d P ( x ) = Λ k · ϕ P for positive finite measure Λ k . 21/53
Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. 22/53
Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. ◮ Important characterizations: ◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on R d whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups 22/53
Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. ◮ Important characterizations: ◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on R d whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups ◮ Examples of characteristic kernels: Gaussian RBF kernel Laplacian kernel � � � � −� x − x ′ � 2 −� x − x ′ � 1 k ( x , x ′ ) = exp 2 k ( x , x ′ ) = exp 2 σ 2 σ 22/53
Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. ◮ Important characterizations: ◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on R d whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups ◮ Examples of characteristic kernels: Gaussian RBF kernel Laplacian kernel � � � � −� x − x ′ � 2 −� x − x ′ � 1 k ( x , x ′ ) = exp 2 k ( x , x ′ ) = exp 2 σ 2 σ ◮ Kernel choice vs parametric assumption ◮ Parametric assumption is susceptible to model misspecification . ◮ But the choice of kernel matters in practice. ◮ We can optimize the kernel to maximize the performance of the downstream tasks. 22/53
Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53
Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 P = 1 � ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53
Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . ◮ Consistency : with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53
Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . ◮ Consistency : with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n ◮ The rate O p ( n − 1 / 2 ) was shown to be minimax optimal . 3 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53
Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . ◮ Consistency : with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n ◮ The rate O p ( n − 1 / 2 ) was shown to be minimax optimal . 3 ◮ Similar to James-Stein estimators, we can improve an estimation by shrinkage estimators: 4 µ α := α f ∗ + (1 − α )ˆ f ∗ ∈ H . ˆ µ P , 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53
Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ 24/53
Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 24/53
Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error � � 2 � � T � � µ P − 1 � � E 2 T = k ( · , x t ) . � � T � t =1 H 24/53
Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error � � 2 � � T � � µ P − 1 � � E 2 T = k ( · , x t ) . � � T � t =1 H ◮ Negative autocorrelation : O (1 / T ) rate of convergence. 24/53
Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error � � 2 � � T � � µ P − 1 � � E 2 T = k ( · , x t ) . � � T � t =1 H ◮ Negative autocorrelation : O (1 / T ) rate of convergence. ◮ Deep generative models (see the following slides). 24/53
Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 25/53
Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . 25/53
Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. 25/53
Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µ P requires no parametric assumption about P . 25/53
Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µ P requires no parametric assumption about P . ◮ It can be estimated consistently, i.e., with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n 25/53
Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µ P requires no parametric assumption about P . ◮ It can be estimated consistently, i.e., with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n ◮ Given the embedding ˆ µ , it is possible to reconstruct the distribution or generate samples from it. 25/53
Application: High-Level Generalization Learning from Distributions � KM. , Fukumizu, Dinuzzo, Sch¨ olkopf. NIPS 2012. 26/53
Application: High-Level Generalization Learning from Distributions Group Anomaly Detection e c p a s n o t i u b i t r s D i e c a p s t u p n I � KM. , Fukumizu, Dinuzzo, � KM. and Sch¨ olkopf, UAI 2013. Sch¨ olkopf. NIPS 2012. 26/53
Application: High-Level Generalization Learning from Distributions Group Anomaly Detection e c p a s n o t i u b i t r s D i e c a p s t u p n I � KM. , Fukumizu, Dinuzzo, � KM. and Sch¨ olkopf, UAI 2013. Sch¨ olkopf. NIPS 2012. Domain Generalization P ... P 1 P 2 P N P X XY XY XY ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n unseen test data training data � KM. et al. ICML 2013; Zhang, KM. et al. ICML 2013 26/53
Application: High-Level Generalization Learning from Distributions Group Anomaly Detection e c p a s n o t i u b i t r s D i e c a p s t u p n I � KM. , Fukumizu, Dinuzzo, � KM. and Sch¨ olkopf, UAI 2013. Sch¨ olkopf. NIPS 2012. Domain Generalization Cause-Effect Inference P ... P 1 P 2 P N P X X Y XY XY XY ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n unseen test data training data � KM. et al. ICML 2013; � Lopez-Paz, KM. et al. Zhang, KM. et al. ICML 2013 JMLR 2015, ICML 2015. 26/53
Support Measure Machine (SMM) KM , K. Fukumizu, F. Dinuzzo, and B. Sch¨ olkopf (NeurIPS2012) � � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) P �→ k ( · , z ) d P ( z ) ( P 1 , y 1 ) , ( P 2 , y 2 ) , . . . , ( P n , y n ) ∼ P × Y Training data : 27/53
Support Measure Machine (SMM) KM , K. Fukumizu, F. Dinuzzo, and B. Sch¨ olkopf (NeurIPS2012) � � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) P �→ k ( · , z ) d P ( z ) ( P 1 , y 1 ) , ( P 2 , y 2 ) , . . . , ( P n , y n ) ∼ P × Y Training data : Theorem (Distributional representer theorem) Under technical assumptions on Ω : [0 , + ∞ ) → R , and a loss function ℓ : ( P × R 2 ) m → R ∪ { + ∞} , any f ∈ H minimizing ℓ ( P 1 , y 1 , E P 1 [ f ] , . . . , P m , y m , E P m [ f ]) + Ω ( � f � H ) admits a representation of the form � m � m α i E x ∼ P i [ k ( x , · )] = f = α i µ P i . i =1 i =1 27/53
Supervised Learning on Point Clouds Training set ( S 1 , y 1 ) , . . . , ( S n , y n ) with S i = { x ( i ) j } ∼ P i ( X ). 28/53
Supervised Learning on Point Clouds Training set ( S 1 , y 1 ) , . . . , ( S n , y n ) with S i = { x ( i ) j } ∼ P i ( X ). Causal Prediction X → Y X ← Y X → Y ? Lopez-Paz, KM. , B. Sch¨ olkopf, I. Tolstikhin. JMLR 2015, ICML 2015. 28/53
Supervised Learning on Point Clouds Training set ( S 1 , y 1 ) , . . . , ( S n , y n ) with S i = { x ( i ) j } ∼ P i ( X ). Causal Prediction X → Y X ← Y X → Y ? Lopez-Paz, KM. , B. Sch¨ olkopf, I. Tolstikhin. JMLR 2015, ICML 2015. Topological Data Analysis G. Kusano, K. Fukumizu, and Y. Hiraoka. JMLR2018 28/53
Domain Generalization Blanchard et al., NeurIPS2012; KM , D. Balduzzi, B. Sch¨ olkopf, ICML2013 P ... P 1 P 2 P N P X XY XY XY ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n unseen test data training data K (( P i , x ) , ( P j , ˜ x )) = k 1 ( P i , P j ) k 2 ( x , ˜ x ) = k 1 ( µ P i , µ P j ) k 2 ( x , ˜ x ) 29/53
Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . 30/53
Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . ◮ MMD is an integral probability metric (IPM) : � � � � � � � � MMD 2 ( P , Q , H ) := h ( x ) d P ( x ) − h ( x ) d Q ( x ) sup � . � h ∈H , � h �≤ 1 30/53
Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . ◮ MMD is an integral probability metric (IPM) : � � � � � � � � MMD 2 ( P , Q , H ) := h ( x ) d P ( x ) − h ( x ) d Q ( x ) sup � . � h ∈H , � h �≤ 1 ◮ If k is characteristic , then � µ P − µ Q � H = 0 if and only if P = Q . 30/53
Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . ◮ MMD is an integral probability metric (IPM) : � � � � � � � � MMD 2 ( P , Q , H ) := h ( x ) d P ( x ) − h ( x ) d Q ( x ) sup � . � h ∈H , � h �≤ 1 ◮ If k is characteristic , then � µ P − µ Q � H = 0 if and only if P = Q . ◮ Given { x i } n i =1 ∼ P and { y j } m j =1 ∼ Q , the empirical MMD is n n m m � � � � 1 1 � MMD 2 u ( P , Q , H ) = k ( x i , x j ) + k ( y i , y j ) n ( n − 1) m ( m − 1) i =1 i =1 j � = i j � = i n m � � − 2 k ( x i , y j ) . nm i =1 j =1 30/53
Kernel Two-Sample Testing Gretton et al., JMLR2012 P P Q Q Question: Given { x i } n i =1 ∼ P and { y j } n j =1 ∼ Q , check if P = Q . H 0 : P = Q , H 1 : P � = Q 31/53
Kernel Two-Sample Testing Gretton et al., JMLR2012 P P Q Q Question: Given { x i } n i =1 ∼ P and { y j } n j =1 ∼ Q , check if P = Q . H 0 : P = Q , H 1 : P � = Q ◮ MMD test statistic: 2 t 2 � = MMD u ( P , Q , H ) � 1 = h (( x i , y i ) , ( x j , y j )) n ( n − 1) 1 ≤ i � = j ≤ n where h (( x i , y i ) , ( x j , y j )) = k ( x i , x j ) + k ( y i , y j ) − k ( x i , y j ) − k ( x j , y i ). 31/53
Generative Adversarial Networks Learn a deep generative model G via a minimax optimization min G max E x [log D ( x )] + E z [log(1 − D ( G ( z )))] D where D is a discriminator and z ∼ N ( 0 , σ 2 I ). Discriminator D φ Generator G θ real or synthetic? random noise z x or G θ ( z ) G θ ( z ) • • • • • • • • • × • × × • • × real data • ×× × × × { x i } synthetic data × ×× × { G θ ( z i ) } MMD Test � � � ˆ � µ X − ˆ µ G θ ( Z ) H is zero? 32/53
Generative Moment Matching Network ◮ The GAN aims to match two distributions P ( X ) and G θ . 33/53
Generative Moment Matching Network ◮ The GAN aims to match two distributions P ( X ) and G θ . ◮ Generative moment matching network (GMMN) proposed by Dziugaite et al. (2015) and Li et al. (2015) considers � � � � 2 � � � � � 2 φ ( ˜ X ) d G θ ( ˜ � µ X − µ G θ ( Z ) � � min H = min φ ( X ) d P ( X ) − X ) � � θ θ H � � � � � � � � � � = min sup h d P − h d G θ � � θ h ∈H , � h �≤ 1 33/53
Generative Moment Matching Network ◮ The GAN aims to match two distributions P ( X ) and G θ . ◮ Generative moment matching network (GMMN) proposed by Dziugaite et al. (2015) and Li et al. (2015) considers � � � � 2 � � � � � 2 φ ( ˜ X ) d G θ ( ˜ � µ X − µ G θ ( Z ) � � min H = min φ ( X ) d P ( X ) − X ) � � θ θ H � � � � � � � � � � = min sup h d P − h d G θ � � θ h ∈H , � h �≤ 1 ◮ Many tricks have been proposed to improve the GMMN: ◮ Optimized kernels and feature extractors (Sutherland et al., 2017; Li et al., 2017a) ◮ Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018) ◮ Repulsive loss (Wang et al., 2019) ◮ Optimized witness points (Mehrjou et al., 2019) ◮ Etc. 33/53
Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development 34/53
Conditional Distribution P ( Y | X ) Y X A collection of distributions P Y := { P ( Y | X = x ) : x ∈ X} . 35/53
Conditional Distribution P ( Y | X ) Y X A collection of distributions P Y := { P ( Y | X = x ) : x ∈ X} . ◮ For each x ∈ X , we can define an embedding of P ( Y | X = x ) as � µ Y | x := ϕ ( Y ) d P ( Y | X = x ) = E Y | x [ ϕ ( Y )] Y where ϕ : Y → G is a feature map of Y . 35/53
Embedding of Conditional Distributions p ( y | x ) P ( Y | X = x ) y C YX C − 1 XX k ( x , · ) G H C YX C − 1 µ Y | X = x k ( x , · ) XX Y X The conditional mean embedding of P ( Y | X ) can be defined as U Y | X := C YX C − 1 U Y | X : H → G , XX 36/53
Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . 37/53
Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . ◮ To represent P ( Y | X = x ) for x ∈ X , it follows that E Y | x [ ϕ ( Y ) | X = x ] = U Y | X k ( x , · ) = C YX C − 1 XX k ( x , · ) =: µ Y | x . 37/53
Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . ◮ To represent P ( Y | X = x ) for x ∈ X , it follows that E Y | x [ ϕ ( Y ) | X = x ] = U Y | X k ( x , · ) = C YX C − 1 XX k ( x , · ) =: µ Y | x . ◮ It follows from the reproducing property of G that E Y | x [ g ( Y ) | X = x ] = � µ Y | x , g � G , ∀ g ∈ G . 37/53
Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . ◮ To represent P ( Y | X = x ) for x ∈ X , it follows that E Y | x [ ϕ ( Y ) | X = x ] = U Y | X k ( x , · ) = C YX C − 1 XX k ( x , · ) =: µ Y | x . ◮ It follows from the reproducing property of G that E Y | x [ g ( Y ) | X = x ] = � µ Y | x , g � G , ∀ g ∈ G . ◮ In an infinite RKHS, C − 1 XX does not exists. Hence, we often use U Y | X := C YX ( C XX + ε I ) − 1 . 37/53
Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . ◮ To represent P ( Y | X = x ) for x ∈ X , it follows that E Y | x [ ϕ ( Y ) | X = x ] = U Y | X k ( x , · ) = C YX C − 1 XX k ( x , · ) =: µ Y | x . ◮ It follows from the reproducing property of G that E Y | x [ g ( Y ) | X = x ] = � µ Y | x , g � G , ∀ g ∈ G . ◮ In an infinite RKHS, C − 1 XX does not exists. Hence, we often use U Y | X := C YX ( C XX + ε I ) − 1 . ◮ Conditional mean estimator n � β ( x ) := ( K + n ε I ) − 1 k x . µ Y | x = ˆ β i ( x ) ϕ ( y i ) , i =1 37/53
Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . 38/53
Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . ◮ We can only observe either P Y ∗ 0 or P Y ∗ 1 . 38/53
Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . ◮ We can only observe either P Y ∗ 0 or P Y ∗ 1 . ◮ Counterfactual distribution � P Y � 0 | 1 � ( y ) = P Y 0 | X 0 ( y | x ) d P X 1 ( x ) . 38/53
Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . ◮ We can only observe either P Y ∗ 0 or P Y ∗ 1 . ◮ Counterfactual distribution � P Y � 0 | 1 � ( y ) = P Y 0 | X 0 ( y | x ) d P X 1 ( x ) . ◮ The counterfactual distribution P Y � 0 | 1 � ( y ) can be estimated using the kernel mean embedding. 38/53
Quantum Mean Embedding 39/53
Quick Summary ◮ Many applications requires information in P ( Y | X ). 40/53
Quick Summary ◮ Many applications requires information in P ( Y | X ). ◮ Hilbert space embedding of P ( Y | X ) is not a single element , but an operator U Y | X mapping from H to G : U Y | X k ( x , · ) = C YX C − 1 µ Y | x = XX k ( x , · ) � µ Y | x , g � G = E Y | x [ g ( Y ) | X = x ] 40/53
Recommend
More recommend