Recent Advances in Hilbert Space Representation of Probability - PowerPoint PPT Presentation

RKHS as Feature Space Reproducing kernels are kernels k ( x , x ′ ) Let H be a Hilbert space on X with a reproducing kernel k . Then, H is an �· , ·� RKHS and is also a feature space of k , where the feature map φ : X → H is given φ x , x ′ φ ( x ) , φ ( x ′ ) by φ ( x ) = k ( · , x ) . We call φ the canonical feature map . Proof We fix an x ′ ∈ X and write f := k ( · , x ′ ). Then, for x ∈ X , the reproducing property implies � φ ( x ′ ) , φ ( x ) � = � k ( · , x ′ ) , k ( · , x ) � = � f , k ( · , x ) � = f ( x ) = k ( x , x ′ ) . 12/53

RKHS as Feature Space Universal kernels (Steinwart 2002) A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C ( X ), i.e., for every function g ∈ C ( X ) and all ε > 0 there exist an f ∈ H such that � f − g � ∞ ≤ ε. 13/53

RKHS as Feature Space Universal kernels (Steinwart 2002) A continuous kernel k on a compact metric space X is called universal if the RKHS H of k is dense in C ( X ), i.e., for every function g ∈ C ( X ) and all ε > 0 there exist an f ∈ H such that � f − g � ∞ ≤ ε. Universal approximation theorem (Cybenko 1989) Given any ε > 0 and f ∈ C ( X ), there exist � n α i ϕ ( w ⊤ h ( x ) = i x + b i ) i =1 such that | f ( x ) − h ( x ) | < ε for all x ∈ X . 13/53

Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H 14/53

Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H of functions on X for which k is a reproducing kernel : k ( x , x ′ ) = � k ( · , x ) , k ( · , x ′ ) � H . f ( x ) = � f , k ( · , x ) � H , 14/53

Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H of functions on X for which k is a reproducing kernel : k ( x , x ′ ) = � k ( · , x ) , k ( · , x ′ ) � H . f ( x ) = � f , k ( · , x ) � H , ◮ Implicit representation of data points : ◮ Support vector machine (SVM) ◮ Gaussian process (GP) ◮ Neural tangent kernel (NTK) 14/53

Quick Summary ◮ A positive definite kernel k ( x , x ′ ) defines an implicit feature map: k ( x , x ′ ) = � φ ( x ) , φ ( x ′ ) � H ◮ There exists a unique reproducing kernel Hilbert space (RKHS) H of functions on X for which k is a reproducing kernel : k ( x , x ′ ) = � k ( · , x ) , k ( · , x ′ ) � H . f ( x ) = � f , k ( · , x ) � H , ◮ Implicit representation of data points : ◮ Support vector machine (SVM) ◮ Gaussian process (GP) ◮ Neural tangent kernel (NTK) ◮ Good references on kernel methods. ◮ Support vector machine (2008), Christmann and Steinwart. ◮ Gaussian process for ML (2005), Rasmussen and Williams. ◮ Learning with kernels (1998), Sch¨ olkopf and Smola. 14/53

Kernel Methods From Points to Probability Measures Embedding of Marginal Distributions Embedding of Conditional Distributions Recent Development 15/53

Probability Measures Learning on Group Anomaly/OOD Distributions/Point Clouds Detection e c a p s n o i t u b i t r s i D e c a p s u t p n I Generalization across Statistical and Causal Inference Environments X Y P ... P 1 P 2 P N P X p ( x ) XY XY XY P Q ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n training data unseen test data x 16/53

Embedding of Dirac Measures Input Space X Feature Space H k ( y , · ) y f x k ( x , · ) 17/53

Embedding of Dirac Measures Input Space X Feature Space H k ( y , · ) y f x k ( x , · ) � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) = k ( · , x ) 17/53

Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Probability measure Let P be a probability measure defined on a measurable space ( X , Σ) with a σ -algebra Σ. 19/53

Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Probability measure Let P be a probability measure defined on a measurable space ( X , Σ) with a σ -algebra Σ. Kernel mean embedding Let P be a space of all probability measures P . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . 19/53

Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x Probability measure Let P be a probability measure defined on a measurable space ( X , Σ) with a σ -algebra Σ. Kernel mean embedding Let P be a space of all probability measures P . A kernel mean embedding is defined by � µ : P → H , P �→ k ( · , x ) d P ( x ) . Remark: The kernel k is Bochner integrable if it is bounded . 19/53

Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then for µ P ∈ H and f ∈ H , � f , µ P � = � f , E X ∼ P [ k ( · , X )] � = E X ∼ P [ � f , k ( · , X ) � ] = E X ∼ P [ f ( X )] . 20/53

Embedding of Marginal Distributions p ( x ) RKHS H P Q µ Q f µ P x � ◮ If E X ∼ P [ k ( X , X )] < ∞ , then for µ P ∈ H and f ∈ H , � f , µ P � = � f , E X ∼ P [ k ( · , X )] � = E X ∼ P [ � f , k ( · , X ) � ] = E X ∼ P [ f ( X )] . ◮ The kernel k is said to be characteristic if the map P �→ µ P is injective , i.e., � µ P − µ Q � H = 0 if and only if P = Q . 20/53

Interpretation of Kernel Mean Representation What properties are captured by µ P ? ◮ k ( x , x ′ ) = � x , x ′ � the first moment of P ◮ k ( x , x ′ ) = ( � x , x ′ � + 1) p moments of P up to order p ∈ N ◮ k ( x , x ′ ) is universal/characteristic all information of P 21/53

Interpretation of Kernel Mean Representation What properties are captured by µ P ? ◮ k ( x , x ′ ) = � x , x ′ � the first moment of P ◮ k ( x , x ′ ) = ( � x , x ′ � + 1) p moments of P up to order p ∈ N ◮ k ( x , x ′ ) is universal/characteristic all information of P Moment-generating function Consider k ( x , x ′ ) = exp( � x , x ′ � ). Then, µ P = E X ∼ P [ e � X , ·� ]. 21/53

Interpretation of Kernel Mean Representation What properties are captured by µ P ? ◮ k ( x , x ′ ) = � x , x ′ � the first moment of P ◮ k ( x , x ′ ) = ( � x , x ′ � + 1) p moments of P up to order p ∈ N ◮ k ( x , x ′ ) is universal/characteristic all information of P Moment-generating function Consider k ( x , x ′ ) = exp( � x , x ′ � ). Then, µ P = E X ∼ P [ e � X , ·� ]. Characteristic function If k ( x , y ) = ψ ( x − y ) where ψ is a positive definite function, then � µ P ( y ) = ψ ( x − y ) d P ( x ) = Λ k · ϕ P for positive finite measure Λ k . 21/53

Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. 22/53

Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. ◮ Important characterizations: ◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on R d whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups 22/53

Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. ◮ Important characterizations: ◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on R d whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups ◮ Examples of characteristic kernels: Gaussian RBF kernel Laplacian kernel � � � � −� x − x ′ � 2 −� x − x ′ � 1 k ( x , x ′ ) = exp 2 k ( x , x ′ ) = exp 2 σ 2 σ 22/53

Characteristic Kernels ◮ All universal kernels are characteristic , but characteristic kernels may not be universal. ◮ Important characterizations: ◮ Discrete kernel on discrete space ◮ Shift-invariant kernels on R d whose Fourier transform has full support. ◮ Integrally strictly positive definite (ISPD) kernels ◮ Characteristic kernels on groups ◮ Examples of characteristic kernels: Gaussian RBF kernel Laplacian kernel � � � � −� x − x ′ � 2 −� x − x ′ � 1 k ( x , x ′ ) = exp 2 k ( x , x ′ ) = exp 2 σ 2 σ ◮ Kernel choice vs parametric assumption ◮ Parametric assumption is susceptible to model misspecification . ◮ But the choice of kernel matters in practice. ◮ We can optimize the kernel to maximize the performance of the downstream tasks. 22/53

Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53

Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 P = 1 � ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53

Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . ◮ Consistency : with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53

Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . ◮ Consistency : with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n ◮ The rate O p ( n − 1 / 2 ) was shown to be minimax optimal . 3 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53

Kernel Mean Estimation ◮ Given an i.i.d. sample x 1 , x 2 , . . . , x n from P , we can estimate µ P by � n � n µ P := 1 � P = 1 ˆ i =1 k ( x i , · ) ∈ H , i =1 δ x i . n n ◮ For each f ∈ H , we have E X ∼ � P [ f ( X )] = � f , ˆ µ P � . ◮ Consistency : with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n ◮ The rate O p ( n − 1 / 2 ) was shown to be minimax optimal . 3 ◮ Similar to James-Stein estimators, we can improve an estimation by shrinkage estimators: 4 µ α := α f ∗ + (1 − α )ˆ f ∗ ∈ H . ˆ µ P , 3 Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings . JMLR, 2017. 4 Muandet et al. Kernel Mean Shrinkage Estimators . JMLR, 2016. 23/53

Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ 24/53

Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 24/53

Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error � � 2 � � T � � µ P − 1 � � E 2 T = k ( · , x t ) . � � T � t =1 H 24/53

Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error � � 2 � � T � � µ P − 1 � � E 2 T = k ( · , x t ) . � � T � t =1 H ◮ Negative autocorrelation : O (1 / T ) rate of convergence. 24/53

Recovering Samples/Distributions ◮ An approximate pre-image problem θ ∗ = arg min ˆ µ µ − µ P θ � 2 � ˆ H . µ P θ P θ θ ◮ The distribution P θ is assumed to be in a certain class K K � � P θ ( x ) = π k N ( x : µ k , Σ k ) , π k = 1 . k =1 k =1 ◮ Kernel herding generates deterministic pseudo-samples by greedily minimizing the squared error � � 2 � � T � � µ P − 1 � � E 2 T = k ( · , x t ) . � � T � t =1 H ◮ Negative autocorrelation : O (1 / T ) rate of convergence. ◮ Deep generative models (see the following slides). 24/53

Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 25/53

Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . 25/53

Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. 25/53

Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µ P requires no parametric assumption about P . 25/53

Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µ P requires no parametric assumption about P . ◮ It can be estimated consistently, i.e., with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n 25/53

Quick Summary ◮ A kernel mean embedding of distribution P � n � µ P := 1 µ P := k ( · , x ) d P ( x ) , ˆ k ( x i , · ) . n i =1 ◮ If k is characteristic , µ P captures all information about P . ◮ All universal kernels are characteristic, but not vice versa. ◮ The empirical ˆ µ P requires no parametric assumption about P . ◮ It can be estimated consistently, i.e., with probability at least 1 − δ , � � 2 log 1 E X ∼ P [ k ( X , X )] δ � ˆ µ P − µ P � H ≤ 2 + . n n ◮ Given the embedding ˆ µ , it is possible to reconstruct the distribution or generate samples from it. 25/53

Application: High-Level Generalization Learning from Distributions � KM. , Fukumizu, Dinuzzo, Sch¨ olkopf. NIPS 2012. 26/53

Application: High-Level Generalization Learning from Distributions Group Anomaly Detection e c p a s n o t i u b i t r s D i e c a p s t u p n I � KM. , Fukumizu, Dinuzzo, � KM. and Sch¨ olkopf, UAI 2013. Sch¨ olkopf. NIPS 2012. 26/53

Application: High-Level Generalization Learning from Distributions Group Anomaly Detection e c p a s n o t i u b i t r s D i e c a p s t u p n I � KM. , Fukumizu, Dinuzzo, � KM. and Sch¨ olkopf, UAI 2013. Sch¨ olkopf. NIPS 2012. Domain Generalization P ... P 1 P 2 P N P X XY XY XY ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n unseen test data training data � KM. et al. ICML 2013; Zhang, KM. et al. ICML 2013 26/53

Application: High-Level Generalization Learning from Distributions Group Anomaly Detection e c p a s n o t i u b i t r s D i e c a p s t u p n I � KM. , Fukumizu, Dinuzzo, � KM. and Sch¨ olkopf, UAI 2013. Sch¨ olkopf. NIPS 2012. Domain Generalization Cause-Effect Inference P ... P 1 P 2 P N P X X Y XY XY XY ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n unseen test data training data � KM. et al. ICML 2013; � Lopez-Paz, KM. et al. Zhang, KM. et al. ICML 2013 JMLR 2015, ICML 2015. 26/53

Support Measure Machine (SMM) KM , K. Fukumizu, F. Dinuzzo, and B. Sch¨ olkopf (NeurIPS2012) � � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) P �→ k ( · , z ) d P ( z ) ( P 1 , y 1 ) , ( P 2 , y 2 ) , . . . , ( P n , y n ) ∼ P × Y Training data : 27/53

Support Measure Machine (SMM) KM , K. Fukumizu, F. Dinuzzo, and B. Sch¨ olkopf (NeurIPS2012) � � x �→ k ( · , x ) δ x �→ k ( · , z ) d δ x ( z ) P �→ k ( · , z ) d P ( z ) ( P 1 , y 1 ) , ( P 2 , y 2 ) , . . . , ( P n , y n ) ∼ P × Y Training data : Theorem (Distributional representer theorem) Under technical assumptions on Ω : [0 , + ∞ ) → R , and a loss function ℓ : ( P × R 2 ) m → R ∪ { + ∞} , any f ∈ H minimizing ℓ ( P 1 , y 1 , E P 1 [ f ] , . . . , P m , y m , E P m [ f ]) + Ω ( � f � H ) admits a representation of the form � m � m α i E x ∼ P i [ k ( x , · )] = f = α i µ P i . i =1 i =1 27/53

Supervised Learning on Point Clouds Training set ( S 1 , y 1 ) , . . . , ( S n , y n ) with S i = { x ( i ) j } ∼ P i ( X ). 28/53

Supervised Learning on Point Clouds Training set ( S 1 , y 1 ) , . . . , ( S n , y n ) with S i = { x ( i ) j } ∼ P i ( X ). Causal Prediction X → Y X ← Y X → Y ? Lopez-Paz, KM. , B. Sch¨ olkopf, I. Tolstikhin. JMLR 2015, ICML 2015. 28/53

Supervised Learning on Point Clouds Training set ( S 1 , y 1 ) , . . . , ( S n , y n ) with S i = { x ( i ) j } ∼ P i ( X ). Causal Prediction X → Y X ← Y X → Y ? Lopez-Paz, KM. , B. Sch¨ olkopf, I. Tolstikhin. JMLR 2015, ICML 2015. Topological Data Analysis G. Kusano, K. Fukumizu, and Y. Hiraoka. JMLR2018 28/53

Domain Generalization Blanchard et al., NeurIPS2012; KM , D. Balduzzi, B. Sch¨ olkopf, ICML2013 P ... P 1 P 2 P N P X XY XY XY ... ( X k , Y k ) ( X k , Y k ) ( X k , Y k ) X k k = 1 , . . . , n 1 k = 1 , . . . , n 2 k = 1 , . . . , n N k = 1 , . . . , n unseen test data training data K (( P i , x ) , ( P j , ˜ x )) = k 1 ( P i , P j ) k 2 ( x , ˜ x ) = k 1 ( µ P i , µ P j ) k 2 ( x , ˜ x ) 29/53

Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . 30/53

Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . ◮ MMD is an integral probability metric (IPM) : � � � � � � � � MMD 2 ( P , Q , H ) := h ( x ) d P ( x ) − h ( x ) d Q ( x ) sup � . � h ∈H , � h �≤ 1 30/53

Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . ◮ MMD is an integral probability metric (IPM) : � � � � � � � � MMD 2 ( P , Q , H ) := h ( x ) d P ( x ) − h ( x ) d Q ( x ) sup � . � h ∈H , � h �≤ 1 ◮ If k is characteristic , then � µ P − µ Q � H = 0 if and only if P = Q . 30/53

Comparing Distributions ◮ Maximum mean discrepancy (MMD) corresponds to the RKHS distance between mean embeddings: MMD 2 ( P , Q , H ) = � µ P − µ Q � 2 H = � µ P � H − 2 � µ P , µ Q � H + � µ Q � H . ◮ MMD is an integral probability metric (IPM) : � � � � � � � � MMD 2 ( P , Q , H ) := h ( x ) d P ( x ) − h ( x ) d Q ( x ) sup � . � h ∈H , � h �≤ 1 ◮ If k is characteristic , then � µ P − µ Q � H = 0 if and only if P = Q . ◮ Given { x i } n i =1 ∼ P and { y j } m j =1 ∼ Q , the empirical MMD is n n m m � � � � 1 1 � MMD 2 u ( P , Q , H ) = k ( x i , x j ) + k ( y i , y j ) n ( n − 1) m ( m − 1) i =1 i =1 j � = i j � = i n m � � − 2 k ( x i , y j ) . nm i =1 j =1 30/53

Kernel Two-Sample Testing Gretton et al., JMLR2012 P P Q Q Question: Given { x i } n i =1 ∼ P and { y j } n j =1 ∼ Q , check if P = Q . H 0 : P = Q , H 1 : P � = Q 31/53

Kernel Two-Sample Testing Gretton et al., JMLR2012 P P Q Q Question: Given { x i } n i =1 ∼ P and { y j } n j =1 ∼ Q , check if P = Q . H 0 : P = Q , H 1 : P � = Q ◮ MMD test statistic: 2 t 2 � = MMD u ( P , Q , H ) � 1 = h (( x i , y i ) , ( x j , y j )) n ( n − 1) 1 ≤ i � = j ≤ n where h (( x i , y i ) , ( x j , y j )) = k ( x i , x j ) + k ( y i , y j ) − k ( x i , y j ) − k ( x j , y i ). 31/53

Generative Adversarial Networks Learn a deep generative model G via a minimax optimization min G max E x [log D ( x )] + E z [log(1 − D ( G ( z )))] D where D is a discriminator and z ∼ N ( 0 , σ 2 I ). Discriminator D φ Generator G θ real or synthetic? random noise z x or G θ ( z ) G θ ( z ) • • • • • • • • • × • × × • • × real data • ×× × × × { x i } synthetic data × ×× × { G θ ( z i ) } MMD Test � � � ˆ � µ X − ˆ µ G θ ( Z ) H is zero? 32/53

Generative Moment Matching Network ◮ The GAN aims to match two distributions P ( X ) and G θ . 33/53

Generative Moment Matching Network ◮ The GAN aims to match two distributions P ( X ) and G θ . ◮ Generative moment matching network (GMMN) proposed by Dziugaite et al. (2015) and Li et al. (2015) considers � � � � 2 � � � � � 2 φ ( ˜ X ) d G θ ( ˜ � µ X − µ G θ ( Z ) � � min H = min φ ( X ) d P ( X ) − X ) � � θ θ H � � � � � � � � � � = min sup h d P − h d G θ � � θ h ∈H , � h �≤ 1 33/53

Generative Moment Matching Network ◮ The GAN aims to match two distributions P ( X ) and G θ . ◮ Generative moment matching network (GMMN) proposed by Dziugaite et al. (2015) and Li et al. (2015) considers � � � � 2 � � � � � 2 φ ( ˜ X ) d G θ ( ˜ � µ X − µ G θ ( Z ) � � min H = min φ ( X ) d P ( X ) − X ) � � θ θ H � � � � � � � � � � = min sup h d P − h d G θ � � θ h ∈H , � h �≤ 1 ◮ Many tricks have been proposed to improve the GMMN: ◮ Optimized kernels and feature extractors (Sutherland et al., 2017; Li et al., 2017a) ◮ Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018) ◮ Repulsive loss (Wang et al., 2019) ◮ Optimized witness points (Mehrjou et al., 2019) ◮ Etc. 33/53

Conditional Distribution P ( Y | X ) Y X A collection of distributions P Y := { P ( Y | X = x ) : x ∈ X} . 35/53

Conditional Distribution P ( Y | X ) Y X A collection of distributions P Y := { P ( Y | X = x ) : x ∈ X} . ◮ For each x ∈ X , we can define an embedding of P ( Y | X = x ) as � µ Y | x := ϕ ( Y ) d P ( Y | X = x ) = E Y | x [ ϕ ( Y )] Y where ϕ : Y → G is a feature map of Y . 35/53

Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . 37/53

Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . ◮ To represent P ( Y | X = x ) for x ∈ X , it follows that E Y | x [ ϕ ( Y ) | X = x ] = U Y | X k ( x , · ) = C YX C − 1 XX k ( x , · ) =: µ Y | x . ◮ It follows from the reproducing property of G that E Y | x [ g ( Y ) | X = x ] = � µ Y | x , g � G , ∀ g ∈ G . ◮ In an infinite RKHS, C − 1 XX does not exists. Hence, we often use U Y | X := C YX ( C XX + ε I ) − 1 . 37/53

Conditional Mean Embedding ◮ To fully represent P ( Y | X ), we need to perform conditioning and conditional expectation . ◮ To represent P ( Y | X = x ) for x ∈ X , it follows that E Y | x [ ϕ ( Y ) | X = x ] = U Y | X k ( x , · ) = C YX C − 1 XX k ( x , · ) =: µ Y | x . ◮ It follows from the reproducing property of G that E Y | x [ g ( Y ) | X = x ] = � µ Y | x , g � G , ∀ g ∈ G . ◮ In an infinite RKHS, C − 1 XX does not exists. Hence, we often use U Y | X := C YX ( C XX + ε I ) − 1 . ◮ Conditional mean estimator n � β ( x ) := ( K + n ε I ) − 1 k x . µ Y | x = ˆ β i ( x ) ϕ ( y i ) , i =1 37/53

Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . 38/53

Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . ◮ We can only observe either P Y ∗ 0 or P Y ∗ 1 . 38/53

Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . ◮ We can only observe either P Y ∗ 0 or P Y ∗ 1 . ◮ Counterfactual distribution � P Y � 0 | 1 � ( y ) = P Y 0 | X 0 ( y | x ) d P X 1 ( x ) . 38/53

Counterfactual Mean Embedding KM , Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted) In economics, social science, and public policy, we need to evaluate the distributional treatment effect (DTE) P Y ∗ 0 ( · ) − P Y ∗ 1 ( · ) where Y ∗ 0 and Y ∗ 1 are potential outcomes of a treatment policy T . ◮ We can only observe either P Y ∗ 0 or P Y ∗ 1 . ◮ Counterfactual distribution � P Y � 0 | 1 � ( y ) = P Y 0 | X 0 ( y | x ) d P X 1 ( x ) . ◮ The counterfactual distribution P Y � 0 | 1 � ( y ) can be estimated using the kernel mean embedding. 38/53

Quantum Mean Embedding 39/53

Quick Summary ◮ Many applications requires information in P ( Y | X ). 40/53

Recent Advances in Hilbert Space Representation of Probability - PowerPoint PPT Presentation

Recent Advances in Hilbert Space Representation of Probability Distributions Krikamol Muandet Max Planck Institute for Intelligent Systems T ubingen, Germany RegML 2020, Genova, Italy July 1, 2020 1/53 Reference Kernel Mean Embedding of

A new weak Hilbert space Jess Surez de la Fuente, UEx Workshop on Banach spaces and Banach

On Hilbert IVth Problem Marc Troyanov (EPFL) SJTU, June 21, 2019 On Hilbert IVth Abstract

Recent Advances In the Recent Advances In the Management of ITP Management of ITP Prof Gregory

Recent Advances in Biomolecular NMR Lucia Banci CERM University of Florence Recent Advances

Recent Advances in Biomolecular NMR Lucia Banci CERM University of Florence Recent Advances

Recent Advances in Photonic Recent Advances in Photonic effect employing IP- based distributed

The Hilbert Series of SQCD Matti J arvinen University of Crete 2 March 2012 1/26

Interpolating Between Hilbert-Samuel and Hilbert-Kunz Multiplicity William D. Taylor University

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Introduction to Hilbert schemes of curves on a 3-fold . Hirokazu Nasu Tokai University Autust

Quantum symmetry groups of Hilbert C*-modules equipped with orthogonal filtrations Manon

Hilbert Functions in Algebra and Geometry Alexandra Seceleanu Department of Mathematics

Phosphate and lipid phosphate linkages in Hilbert Space configuration with Angstrom calibration

Recent advances in Mandelbrot martingales theory Julien Barral, Universit e Paris Nord

Seminar on Seminar on Recent Developments in Project Management Recent Developments in Project

Recent Advances in Adversarial Machine Learning Nicholas Carlini Google Research Recent

Forst: Question Answering System for Term and Essay Questions at NTCIR-13 QA Lab-3 Task Kotaro

Water Vapor Seasonal Variability: AIRS L3 results vs. climate models Thomas Hearty, George

FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures Tayo Oguntebi,

Prismatic HTGR Dominique Hittner www.nc2i.eu NC2I is one of SNETPs strategic technological

di dimen ension sion re redu ducti ction on Yury Makarychev, TTIC Konstantin Makarychev,

Partial Default Cristina Arellano, Xavier Mateos-Planas and Jose-Victor Rios-Rull Mpls Fed, Univ

CDC Coronavirus Disease 2019 Response Disparities in COVID-19 Incidence, Severity, and Outcomes

Give birth to the end of Hep B Hepatitis B: What Hospitals Need to Do to Protect Newborns A