Contributions We propose a kernel-based method for point estimation of simulation-based statistical models . The proposed approach (termed kernel recursive ABC ) ◮ is based on kernel mean embeddings, ◮ is a combination of kernel ABC and kernel herding, and ◮ recursively applies Bayes’ rule to the same observed data. It should be useful when point estimation is more desirable than the fully Bayesian approach. For instance: ◮ when your prior distribution π ( θ ) is not fully reliable, ◮ when one simulation is computationally very expensive, and ◮ when your purpose is on predictions based on simulations. 12 / 44
Outline Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions 13 / 44
Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . 14 / 44
Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . The function k ( x , x ′ ) is called a positive definite kernel , if n n � � c i c j k ( x i , x j ) ≥ 0 holds i = 1 j = 1 for all n ∈ N , c 1 , . . . , c n ∈ R , x 1 , . . . , x n ∈ X . 14 / 44
Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . The function k ( x , x ′ ) is called a positive definite kernel , if n n � � c i c j k ( x i , x j ) ≥ 0 holds i = 1 j = 1 for all n ∈ N , c 1 , . . . , c n ∈ R , x 1 , . . . , x n ∈ X . Examples of positive definite kernels on X = R d : k ( x , x ′ ) exp ( −� x − x ′ � 2 /γ 2 ) . Gaussian = k ( x , x ′ ) exp ( −� x − x ′ � /γ ) . Laplace (Matérn) = � x , x ′ � k ( x , x ′ ) Linear = . � x , x ′ � + c ) m . k ( x , x ′ ) Polynomial = ( 14 / 44
Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . The function k ( x , x ′ ) is called a positive definite kernel , if n n � � c i c j k ( x i , x j ) ≥ 0 holds i = 1 j = 1 for all n ∈ N , c 1 , . . . , c n ∈ R , x 1 , . . . , x n ∈ X . Examples of positive definite kernels on X = R d : k ( x , x ′ ) exp ( −� x − x ′ � 2 /γ 2 ) . Gaussian = k ( x , x ′ ) exp ( −� x − x ′ � /γ ) . Laplace (Matérn) = � x , x ′ � k ( x , x ′ ) Linear = . � x , x ′ � + c ) m . k ( x , x ′ ) Polynomial = ( In this talk, I will simply call k a kernel . 14 / 44
Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that 15 / 44
Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X 15 / 44
Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . 15 / 44
Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . (ii) f ( x ) = � f , k ( · , x ) � H for all f ∈ H and x ∈ X , which is called the reproducing property . 15 / 44
Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . (ii) f ( x ) = � f , k ( · , x ) � H for all f ∈ H and x ∈ X , which is called the reproducing property . – H is called the RKHS of k . 15 / 44
Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . (ii) f ( x ) = � f , k ( · , x ) � H for all f ∈ H and x ∈ X , which is called the reproducing property . – H is called the RKHS of k . – H can be written as H = span { k ( · , x ) | x ∈ X} 15 / 44
Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . 16 / 44
Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . 16 / 44
Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. 16 / 44
Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. For each distribution P ∈ P , define the kernel mean : � µ P := k ( · , x ) dP ( x ) ∈ H . which is a representation of P in H . 16 / 44
Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. For each distribution P ∈ P , define the kernel mean : � µ P := k ( · , x ) dP ( x ) ∈ H . which is a representation of P in H . A key concept: Characteristic kernels [Fukumizu et al., 2008]. – The kernel k is called characteristic , if for any P , Q ∈ P , µ P = µ Q if and only if P = Q . 16 / 44
Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. For each distribution P ∈ P , define the kernel mean : � µ P := k ( · , x ) dP ( x ) ∈ H . which is a representation of P in H . A key concept: Characteristic kernels [Fukumizu et al., 2008]. – The kernel k is called characteristic , if for any P , Q ∈ P , µ P = µ Q if and only if P = Q . – In other words, k is characteristic if the mapping P ∈ P → µ P ∈ H is injective . 16 / 44
Kernel Mean Embeddings [Smola et al., 2007] Intuitively, k being characteristic implies that H is large enough. Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3] 17 / 44
Kernel Mean Embeddings [Smola et al., 2007] Intuitively, k being characteristic implies that H is large enough. Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3] Examples of characteristic kernels on X = R d : – Gaussian and Matérn kernels [Sriperumbudur et al., 2010]. 17 / 44
Kernel Mean Embeddings [Smola et al., 2007] Intuitively, k being characteristic implies that H is large enough. Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3] Examples of characteristic kernels on X = R d : – Gaussian and Matérn kernels [Sriperumbudur et al., 2010]. Examples of non -characteristic kernels on X = R d : – Linear and polynomial kernels. 17 / 44
Outline Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions 18 / 44
Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior 19 / 44
Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ 19 / 44
Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ 19 / 44
Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = 19 / 44
Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ 19 / 44
Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = 19 / 44
Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = · · · 19 / 44
Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = · · · π N ( θ ) := p N ( θ | y ∗ ) p ( y ∗ | θ ) π N − 1 ( θ ) N -th recursion ∝ 19 / 44
Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = · · · π N ( θ ) := p N ( θ | y ∗ ) p ( y ∗ | θ ) π N − 1 ( θ ) N -th recursion ∝ p ( y ∗ | θ ) N π ( θ ) . = 19 / 44
Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) 20 / 44
Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) Theorem [Lele et al., 2010]. Assume that p ( y ∗ | θ ) has a unique global maximizer: θ ∗ := arg max θ ∈ Θ p ( y ∗ | θ ) . 20 / 44
Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) Theorem [Lele et al., 2010]. Assume that p ( y ∗ | θ ) has a unique global maximizer: θ ∗ := arg max θ ∈ Θ p ( y ∗ | θ ) . Then, if π ( θ ∗ ) > 0, under mild conditions on π ( θ ) and p ( y | θ ) , p N ( θ | y ∗ ) → δ θ ∗ as N → ∞ ( weak convergence ) . ���� Dirac at θ ∗ 20 / 44
Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) Theorem [Lele et al., 2010]. Assume that p ( y ∗ | θ ) has a unique global maximizer: θ ∗ := arg max θ ∈ Θ p ( y ∗ | θ ) . Then, if π ( θ ∗ ) > 0, under mild conditions on π ( θ ) and p ( y | θ ) , p N ( θ | y ∗ ) → δ θ ∗ as N → ∞ ( weak convergence ) . ���� Dirac at θ ∗ This implies that recursive Bayes updates provide a way of Maximum Likelihood Estimation . 20 / 44
Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. 21 / 44
Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 21 / 44
Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. 21 / 44
Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . 21 / 44
Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . – Estimate the kernel mean of the power posterior using ( θ i , y i ) n i = 1 : � p N ( θ | y ∗ ) d θ µ P N := k Θ ( · , θ ) (1) � �� � Kernel on Θ p N ( θ | y ∗ ) ∝ p N ( y | θ ) π ( θ ) where 21 / 44
Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . – Estimate the kernel mean of the power posterior using ( θ i , y i ) n i = 1 : � p N ( θ | y ∗ ) d θ µ P N := k Θ ( · , θ ) (1) � �� � Kernel on Θ p N ( θ | y ∗ ) ∝ p N ( y | θ ) π ( θ ) where 2. Kernel Herding : Sampling θ ′ 1 , . . . , θ ′ n from the estimate of (1): 21 / 44
Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . – Estimate the kernel mean of the power posterior using ( θ i , y i ) n i = 1 : � p N ( θ | y ∗ ) d θ µ P N := k Θ ( · , θ ) (1) � �� � Kernel on Θ p N ( θ | y ∗ ) ∝ p N ( y | θ ) π ( θ ) where 2. Kernel Herding : Sampling θ ′ 1 , . . . , θ ′ n from the estimate of (1): Set: N ← N + 1 and ( θ 1 , . . . , θ n ) ← ( θ ′ 1 , . . . , θ ′ n ) 21 / 44
Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 22 / 44
Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. 22 / 44
Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. 2. Weight computation : Given observed data y ∗ , compute ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ∈ R n . k Y ( y ∗ ) := ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ ( G Y + n λ I n ) − 1 k Y ( y ∗ ) ∈ R n , := where G Y := ( k Y ( y i , y j )) ∈ R n × n is the kernel matrix. 22 / 44
Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. 2. Weight computation : Given observed data y ∗ , compute ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ∈ R n . k Y ( y ∗ ) := ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ ( G Y + n λ I n ) − 1 k Y ( y ∗ ) ∈ R n , := where G Y := ( k Y ( y i , y j )) ∈ R n × n is the kernel matrix. Output : An estimate of the posterior kernel mean: � n � k Θ ( · , θ ) p ( θ | y ∗ ) d θ w i ( y ∗ ) k Θ ( · , θ i ) , ≈ i = 1 p ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . ∝ 22 / 44
Kernel ABC: The Sampling Step 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Sampling Parameter space Θ θ 1 θ 2 θ 3 θ * θ 4 Sampling π ( θ ) Prior distribution 23 / 44
Kernel ABC: The Weight Computation Step 2. Weight computation : Given observed data y ∗ , compute k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , 1. Similarities: ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . 2. Weights: Data space 1. Similarity computation y * 풴 y 3 y 4 y 1 y 2 Parameter 2. Weight space computation Θ θ 3 θ 1 θ 2 θ * θ 4 n � � k Θ ( · , θ ) p ( θ | y ∗ ) d θ ≈ w i ( y ∗ ) k Θ ( · , θ i ) . i = 1 24 / 44
Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. 25 / 44
Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that 25 / 44
Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that – sequentially generates sample points θ ′ 1 , . . . , θ ′ n from P as θ ′ := arg max θ ∈ Θ µ P ( θ ) , 1 25 / 44
Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that – sequentially generates sample points θ ′ 1 , . . . , θ ′ n from P as θ ′ := arg max θ ∈ Θ µ P ( θ ) , 1 T − 1 − 1 � θ ′ k Θ ( θ, θ ′ := arg max µ P ( θ ) ℓ ) ( T = 2 , . . . , n ) . T T θ ∈ Θ � �� � � �� � ℓ = 1 mode seeking repulsive force 25 / 44
Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that – sequentially generates sample points θ ′ 1 , . . . , θ ′ n from P as θ ′ := arg max θ ∈ Θ µ P ( θ ) , 1 T − 1 − 1 � θ ′ k Θ ( θ, θ ′ := arg max µ P ( θ ) ℓ ) ( T = 2 , . . . , n ) . T T θ ∈ Θ � �� � � �� � ℓ = 1 mode seeking repulsive force – is equivalent to greedily approximating the kernel mean µ P : � � �� T − 1 � � � µ P − 1 � θ ′ � k Θ ( · , θ ′ � T = arg min k Θ ( · , θ ) + i ) , � � T θ ∈ Θ � i = 1 H Θ if k Θ is shift-invariant. ( H Θ is the RKHS of k Θ .) 25 / 44
Kernel Herding [Chen et al., 2010] Red squares: Sample points generated from kernel herding Purple circles: Randomly generated i.i.d. sample points. 4 3 2 1 0 − 1 − 2 − 3 − 4 − 5 − 6 − 6 − 4 − 2 0 2 4 6 - Figure 3: [Chen et al., 2010, Fig 1] 26 / 44
Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 27 / 44
Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. 27 / 44
Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , 27 / 44
Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , – Compute weights for θ 1 , . . . , θ n : k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . 27 / 44
Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , – Compute weights for θ 1 , . . . , θ n : k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . µ P N := � n i = 1 w i ( y ∗ ) k Θ ( · , θ i ) : 2. Kernel Herding : Sampling from ˆ T − 1 µ P N ( θ ) − 1 � θ ′ k ( θ, θ ′ T := arg max θ ∈ Θ ˆ ℓ ) ( T = 1 , . . . , n ) . T ℓ = 1 27 / 44
Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , – Compute weights for θ 1 , . . . , θ n : k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . µ P N := � n i = 1 w i ( y ∗ ) k Θ ( · , θ i ) : 2. Kernel Herding : Sampling from ˆ T − 1 µ P N ( θ ) − 1 � θ ′ k ( θ, θ ′ T := arg max θ ∈ Θ ˆ ℓ ) ( T = 1 , . . . , n ) . T ℓ = 1 Set: N ← N + 1 and ( θ 1 , . . . , θ n ) ← ( θ ′ 1 , . . . , θ ′ n ) 27 / 44
Why Kernels? – The combination of Kernel ABC and Kernel Herding leads to robustness against misspecfication of the prior π ( θ ) . 28 / 44
Outline Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions 29 / 44
Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . 30 / 44
Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . In such a case, you may misspecify the prior π ( θ ) . 30 / 44
Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . In such a case, you may misspecify the prior π ( θ ) . ◮ e.g., the support of π ( θ ) may not contain θ ∗ . 30 / 44
Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . In such a case, you may misspecify the prior π ( θ ) . ◮ e.g., the support of π ( θ ) may not contain θ ∗ . As a result, simulated data y i ∼ p ( y | θ i ) , θ i ∼ π ( θ ) ( i = 1 , . . . , n ) . may become far apart from observed data y ∗ . 30 / 44
Prior Misspecification Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar Parameter space Θ θ * θ 3 θ 1 θ 4 θ 2 True parameter Misspecification Prior support π ( θ ) Prior distribution 31 / 44
Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . 32 / 44
Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . — e.g. a Gaussian kernel: k Y ( y ∗ , y i ) = exp ( − dist 2 ( y ∗ , y i ) /γ 2 ) 32 / 44
Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . — e.g. a Gaussian kernel: k Y ( y ∗ , y i ) = exp ( − dist 2 ( y ∗ , y i ) /γ 2 ) – Therefore, if y ∗ and each y i are dissimilar , we have k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ≈ 0 32 / 44
Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . — e.g. a Gaussian kernel: k Y ( y ∗ , y i ) = exp ( − dist 2 ( y ∗ , y i ) /γ 2 ) – Therefore, if y ∗ and each y i are dissimilar , we have k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ≈ 0 – As a result, the weights by Kernel ABC become ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) ≈ 0 32 / 44
Recommend
More recommend