Kernel Recursive ABC: Point Estimation with Intractable Likelihood - PowerPoint PPT Presentation

Contributions We propose a kernel-based method for point estimation of simulation-based statistical models . The proposed approach (termed kernel recursive ABC ) ◮ is based on kernel mean embeddings, ◮ is a combination of kernel ABC and kernel herding, and ◮ recursively applies Bayes’ rule to the same observed data. It should be useful when point estimation is more desirable than the fully Bayesian approach. For instance: ◮ when your prior distribution π ( θ ) is not fully reliable, ◮ when one simulation is computationally very expensive, and ◮ when your purpose is on predictions based on simulations. 12 / 44

Outline Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions 13 / 44

Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . 14 / 44

Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . The function k ( x , x ′ ) is called a positive definite kernel , if n n � � c i c j k ( x i , x j ) ≥ 0 holds i = 1 j = 1 for all n ∈ N , c 1 , . . . , c n ∈ R , x 1 , . . . , x n ∈ X . 14 / 44

Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . The function k ( x , x ′ ) is called a positive definite kernel , if n n � � c i c j k ( x i , x j ) ≥ 0 holds i = 1 j = 1 for all n ∈ N , c 1 , . . . , c n ∈ R , x 1 , . . . , x n ∈ X . Examples of positive definite kernels on X = R d : k ( x , x ′ ) exp ( −� x − x ′ � 2 /γ 2 ) . Gaussian = k ( x , x ′ ) exp ( −� x − x ′ � /γ ) . Laplace (Matérn) = � x , x ′ � k ( x , x ′ ) Linear = . � x , x ′ � + c ) m . k ( x , x ′ ) Polynomial = ( 14 / 44

Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . The function k ( x , x ′ ) is called a positive definite kernel , if n n � � c i c j k ( x i , x j ) ≥ 0 holds i = 1 j = 1 for all n ∈ N , c 1 , . . . , c n ∈ R , x 1 , . . . , x n ∈ X . Examples of positive definite kernels on X = R d : k ( x , x ′ ) exp ( −� x − x ′ � 2 /γ 2 ) . Gaussian = k ( x , x ′ ) exp ( −� x − x ′ � /γ ) . Laplace (Matérn) = � x , x ′ � k ( x , x ′ ) Linear = . � x , x ′ � + c ) m . k ( x , x ′ ) Polynomial = ( In this talk, I will simply call k a kernel . 14 / 44

Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that 15 / 44

Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X 15 / 44

Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . 15 / 44

Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . (ii) f ( x ) = � f , k ( · , x ) � H for all f ∈ H and x ∈ X , which is called the reproducing property . 15 / 44

Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . (ii) f ( x ) = � f , k ( · , x ) � H for all f ∈ H and x ∈ X , which is called the reproducing property . – H is called the RKHS of k . 15 / 44

Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . (ii) f ( x ) = � f , k ( · , x ) � H for all f ∈ H and x ∈ X , which is called the reproducing property . – H is called the RKHS of k . – H can be written as H = span { k ( · , x ) | x ∈ X} 15 / 44

Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . 16 / 44

Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . 16 / 44

Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. 16 / 44

Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. For each distribution P ∈ P , define the kernel mean : � µ P := k ( · , x ) dP ( x ) ∈ H . which is a representation of P in H . 16 / 44

Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. For each distribution P ∈ P , define the kernel mean : � µ P := k ( · , x ) dP ( x ) ∈ H . which is a representation of P in H . A key concept: Characteristic kernels [Fukumizu et al., 2008]. – The kernel k is called characteristic , if for any P , Q ∈ P , µ P = µ Q if and only if P = Q . 16 / 44

Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. For each distribution P ∈ P , define the kernel mean : � µ P := k ( · , x ) dP ( x ) ∈ H . which is a representation of P in H . A key concept: Characteristic kernels [Fukumizu et al., 2008]. – The kernel k is called characteristic , if for any P , Q ∈ P , µ P = µ Q if and only if P = Q . – In other words, k is characteristic if the mapping P ∈ P → µ P ∈ H is injective . 16 / 44

Kernel Mean Embeddings [Smola et al., 2007] Intuitively, k being characteristic implies that H is large enough. Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3] 17 / 44

Kernel Mean Embeddings [Smola et al., 2007] Intuitively, k being characteristic implies that H is large enough. Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3] Examples of characteristic kernels on X = R d : – Gaussian and Matérn kernels [Sriperumbudur et al., 2010]. 17 / 44

Kernel Mean Embeddings [Smola et al., 2007] Intuitively, k being characteristic implies that H is large enough. Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3] Examples of characteristic kernels on X = R d : – Gaussian and Matérn kernels [Sriperumbudur et al., 2010]. Examples of non -characteristic kernels on X = R d : – Linear and polynomial kernels. 17 / 44

Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� Posterior Likelihood Prior 19 / 44

Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ 19 / 44

Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ 19 / 44

Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = 19 / 44

Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ 19 / 44

Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = 19 / 44

Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = · · · 19 / 44

Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = · · · π N ( θ ) := p N ( θ | y ∗ ) p ( y ∗ | θ ) π N − 1 ( θ ) N -th recursion ∝ 19 / 44

Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = · · · π N ( θ ) := p N ( θ | y ∗ ) p ( y ∗ | θ ) π N − 1 ( θ ) N -th recursion ∝ p ( y ∗ | θ ) N π ( θ ) . = 19 / 44

Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) 20 / 44

Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) Theorem [Lele et al., 2010]. Assume that p ( y ∗ | θ ) has a unique global maximizer: θ ∗ := arg max θ ∈ Θ p ( y ∗ | θ ) . 20 / 44

Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) Theorem [Lele et al., 2010]. Assume that p ( y ∗ | θ ) has a unique global maximizer: θ ∗ := arg max θ ∈ Θ p ( y ∗ | θ ) . Then, if π ( θ ∗ ) > 0, under mild conditions on π ( θ ) and p ( y | θ ) , p N ( θ | y ∗ ) → δ θ ∗ as N → ∞ ( weak convergence ) . �� Dirac at θ ∗ 20 / 44

Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) Theorem [Lele et al., 2010]. Assume that p ( y ∗ | θ ) has a unique global maximizer: θ ∗ := arg max θ ∈ Θ p ( y ∗ | θ ) . Then, if π ( θ ∗ ) > 0, under mild conditions on π ( θ ) and p ( y | θ ) , p N ( θ | y ∗ ) → δ θ ∗ as N → ∞ ( weak convergence ) . �� Dirac at θ ∗ This implies that recursive Bayes updates provide a way of Maximum Likelihood Estimation . 20 / 44

Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. 21 / 44

Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 21 / 44

Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. 21 / 44

Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . 21 / 44

Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . – Estimate the kernel mean of the power posterior using ( θ i , y i ) n i = 1 : � p N ( θ | y ∗ ) d θ µ P N := k Θ ( · , θ ) (1) � �� Kernel on Θ p N ( θ | y ∗ ) ∝ p N ( y | θ ) π ( θ ) where 21 / 44

Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . – Estimate the kernel mean of the power posterior using ( θ i , y i ) n i = 1 : � p N ( θ | y ∗ ) d θ µ P N := k Θ ( · , θ ) (1) � �� Kernel on Θ p N ( θ | y ∗ ) ∝ p N ( y | θ ) π ( θ ) where 2. Kernel Herding : Sampling θ ′ 1 , . . . , θ ′ n from the estimate of (1): 21 / 44

Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . – Estimate the kernel mean of the power posterior using ( θ i , y i ) n i = 1 : � p N ( θ | y ∗ ) d θ µ P N := k Θ ( · , θ ) (1) � �� Kernel on Θ p N ( θ | y ∗ ) ∝ p N ( y | θ ) π ( θ ) where 2. Kernel Herding : Sampling θ ′ 1 , . . . , θ ′ n from the estimate of (1): Set: N ← N + 1 and ( θ 1 , . . . , θ n ) ← ( θ ′ 1 , . . . , θ ′ n ) 21 / 44

Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 22 / 44

Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. 22 / 44

Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. 2. Weight computation : Given observed data y ∗ , compute ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ∈ R n . k Y ( y ∗ ) := ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ ( G Y + n λ I n ) − 1 k Y ( y ∗ ) ∈ R n , := where G Y := ( k Y ( y i , y j )) ∈ R n × n is the kernel matrix. 22 / 44

Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. 2. Weight computation : Given observed data y ∗ , compute ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ∈ R n . k Y ( y ∗ ) := ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ ( G Y + n λ I n ) − 1 k Y ( y ∗ ) ∈ R n , := where G Y := ( k Y ( y i , y j )) ∈ R n × n is the kernel matrix. Output : An estimate of the posterior kernel mean: � n � k Θ ( · , θ ) p ( θ | y ∗ ) d θ w i ( y ∗ ) k Θ ( · , θ i ) , ≈ i = 1 p ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . ∝ 22 / 44

Kernel ABC: The Sampling Step 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Sampling Parameter space Θ θ 1 θ 2 θ 3 θ * θ 4 Sampling π ( θ ) Prior distribution 23 / 44

Kernel ABC: The Weight Computation Step 2. Weight computation : Given observed data y ∗ , compute k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , 1. Similarities: ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . 2. Weights: Data space 1. Similarity computation y * 풴 y 3 y 4 y 1 y 2 Parameter 2. Weight space computation Θ θ 3 θ 1 θ 2 θ * θ 4 n � � k Θ ( · , θ ) p ( θ | y ∗ ) d θ ≈ w i ( y ∗ ) k Θ ( · , θ i ) . i = 1 24 / 44

Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. 25 / 44

Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that 25 / 44

Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that – sequentially generates sample points θ ′ 1 , . . . , θ ′ n from P as θ ′ := arg max θ ∈ Θ µ P ( θ ) , 1 25 / 44

Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that – sequentially generates sample points θ ′ 1 , . . . , θ ′ n from P as θ ′ := arg max θ ∈ Θ µ P ( θ ) , 1 T − 1 − 1 � θ ′ k Θ ( θ, θ ′ := arg max µ P ( θ ) ℓ ) ( T = 2 , . . . , n ) . T T θ ∈ Θ � �� ℓ = 1 mode seeking repulsive force 25 / 44

Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that – sequentially generates sample points θ ′ 1 , . . . , θ ′ n from P as θ ′ := arg max θ ∈ Θ µ P ( θ ) , 1 T − 1 − 1 � θ ′ k Θ ( θ, θ ′ := arg max µ P ( θ ) ℓ ) ( T = 2 , . . . , n ) . T T θ ∈ Θ � �� ℓ = 1 mode seeking repulsive force – is equivalent to greedily approximating the kernel mean µ P : � � �� T − 1 � � � µ P − 1 � θ ′ � k Θ ( · , θ ′ � T = arg min k Θ ( · , θ ) + i ) , � � T θ ∈ Θ � i = 1 H Θ if k Θ is shift-invariant. ( H Θ is the RKHS of k Θ .) 25 / 44

Kernel Herding [Chen et al., 2010] Red squares: Sample points generated from kernel herding Purple circles: Randomly generated i.i.d. sample points. 4 3 2 1 0 − 1 − 2 − 3 − 4 − 5 − 6 − 6 − 4 − 2 0 2 4 6 - Figure 3: [Chen et al., 2010, Fig 1] 26 / 44

Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 27 / 44

Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. 27 / 44

Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , 27 / 44

Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , – Compute weights for θ 1 , . . . , θ n : k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . 27 / 44

Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , – Compute weights for θ 1 , . . . , θ n : k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . µ P N := � n i = 1 w i ( y ∗ ) k Θ ( · , θ i ) : 2. Kernel Herding : Sampling from ˆ T − 1 µ P N ( θ ) − 1 � θ ′ k ( θ, θ ′ T := arg max θ ∈ Θ ˆ ℓ ) ( T = 1 , . . . , n ) . T ℓ = 1 27 / 44

Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , – Compute weights for θ 1 , . . . , θ n : k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . µ P N := � n i = 1 w i ( y ∗ ) k Θ ( · , θ i ) : 2. Kernel Herding : Sampling from ˆ T − 1 µ P N ( θ ) − 1 � θ ′ k ( θ, θ ′ T := arg max θ ∈ Θ ˆ ℓ ) ( T = 1 , . . . , n ) . T ℓ = 1 Set: N ← N + 1 and ( θ 1 , . . . , θ n ) ← ( θ ′ 1 , . . . , θ ′ n ) 27 / 44

Why Kernels? – The combination of Kernel ABC and Kernel Herding leads to robustness against misspecfication of the prior π ( θ ) . 28 / 44

Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . 30 / 44

Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . In such a case, you may misspecify the prior π ( θ ) . 30 / 44

Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . In such a case, you may misspecify the prior π ( θ ) . ◮ e.g., the support of π ( θ ) may not contain θ ∗ . 30 / 44

Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . In such a case, you may misspecify the prior π ( θ ) . ◮ e.g., the support of π ( θ ) may not contain θ ∗ . As a result, simulated data y i ∼ p ( y | θ i ) , θ i ∼ π ( θ ) ( i = 1 , . . . , n ) . may become far apart from observed data y ∗ . 30 / 44

Prior Misspecification Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar Parameter space Θ θ * θ 3 θ 1 θ 4 θ 2 True parameter Misspecification Prior support π ( θ ) Prior distribution 31 / 44

Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . 32 / 44

Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . — e.g. a Gaussian kernel: k Y ( y ∗ , y i ) = exp ( − dist 2 ( y ∗ , y i ) /γ 2 ) 32 / 44

Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . — e.g. a Gaussian kernel: k Y ( y ∗ , y i ) = exp ( − dist 2 ( y ∗ , y i ) /γ 2 ) – Therefore, if y ∗ and each y i are dissimilar , we have k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ≈ 0 32 / 44

Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . — e.g. a Gaussian kernel: k Y ( y ∗ , y i ) = exp ( − dist 2 ( y ∗ , y i ) /γ 2 ) – Therefore, if y ∗ and each y i are dissimilar , we have k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ≈ 0 – As a result, the weights by Kernel ABC become ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) ≈ 0 32 / 44

Kernel Recursive ABC: Point Estimation with Intractable Likelihood - PowerPoint PPT Presentation

Kernel Recursive ABC: Point Estimation with Intractable Likelihood Motonobu Kanagawa EURECOM, Sophia Antipolis, France (Previously U. Tbingen) ISM-UUlm Workshop, October 2019 1 / 44 Contents of This Talk 1. Kernel Recursive ABC: Point

Evidence estimation for Markov random fields: a triply intractable problem Richard Everitt

American Boating Congress May 14, 2020 2020 ABC Sponsors Thank You to our 2020 ABC Sponsors

61A Lecture 6 Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Recursive Methods Noter ch.2 Recursive Methods Recursive problem solution Problems

Recursion Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Intractable Problems and DP with Bitmask Problem Solving Club March 1, 2017 Agenda

Assessing the Stability of Forecasting Models: Recursive Parameter Estimation and Recursive

Lesson 9 Recursive Types 2/19, 21 Chapters 20, 21 Recursive type Recursive type terms are

Recursive Methods Recursive problem solution Problems that are naturally solved by

Update on ABC University Transportation Center (ABC UTC) by Atorod Azizinamini, Ph.D., P.E.

Point Estimation The goal of Point Estimation is to find the point in -space which gives the

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Non-Recursive In-Place FFT Algorithm Idea: "Unwind the in-place recursive algorithm and work

Recursion Announcements Recursive Functions Recursive Functions Definition : A function is

Non-parametric Methods Oliver Schulte - CMPT 726 Bishop PRML Ch. 2.5 Kernel Density Estimation

USE OF 3D MODELLING IN IOWA DOT ACCELERATED BRIDGE CONSTRUCTION (ABC) PROJECTS Jim Nelson ABC

Behavioural models Cognitive biases Marcus Bendtsen Department of Computer and Information

Projection based transfer learning Christian Poelitz Dortmund Technical University Christian

Pr srtr

Gaussian Cheap Talk Game with Quadratic Cost Functions: When Herding between Strategic Senders Is

CAPS Community Accreditation for Produce Safety If you buy from CAPS buys you LOCAL FARMS

Inference in ecology and evolution beyond generalised linear mixed models Reinder Radersma

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER EROGENEO ENEOUS US PARALLE

Kernel Recursive ABC: Point Estimation with Intractable Likelihood - PowerPoint PPT Presentation

Kernel Recursive ABC: Point Estimation with Intractable Likelihood Motonobu Kanagawa EURECOM, Sophia Antipolis, France (Previously U. Tbingen) ISM-UUlm Workshop, October 2019 1 / 44 Contents of This Talk 1. Kernel Recursive ABC: Point

Evidence estimation for Markov random fields: a triply intractable problem Richard Everitt

American Boating Congress May 14, 2020 2020 ABC Sponsors Thank You to our 2020 ABC Sponsors

61A Lecture 6 Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Recursive Methods Noter ch.2 Recursive Methods Recursive problem solution Problems

Recursion Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Intractable Problems and DP with Bitmask Problem Solving Club March 1, 2017 Agenda

Assessing the Stability of Forecasting Models: Recursive Parameter Estimation and Recursive

Lesson 9 Recursive Types 2/19, 21 Chapters 20, 21 Recursive type Recursive type terms are

Recursive Methods Recursive problem solution Problems that are naturally solved by

Update on ABC University Transportation Center (ABC UTC) by Atorod Azizinamini, Ph.D., P.E.

Point Estimation The goal of Point Estimation is to find the point in -space which gives the

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Non-Recursive In-Place FFT Algorithm Idea: &quot;Unwind the in-place recursive algorithm and work

Recursion Announcements Recursive Functions Recursive Functions Definition : A function is

Non-parametric Methods Oliver Schulte - CMPT 726 Bishop PRML Ch. 2.5 Kernel Density Estimation

USE OF 3D MODELLING IN IOWA DOT ACCELERATED BRIDGE CONSTRUCTION (ABC) PROJECTS Jim Nelson ABC

Behavioural models Cognitive biases Marcus Bendtsen Department of Computer and Information

Projection based transfer learning Christian Poelitz Dortmund Technical University Christian

Pr srtr

Gaussian Cheap Talk Game with Quadratic Cost Functions: When Herding between Strategic Senders Is

CAPS Community Accreditation for Produce Safety If you buy from CAPS buys you LOCAL FARMS

Inference in ecology and evolution beyond generalised linear mixed models Reinder Radersma

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER EROGENEO ENEOUS US PARALLE

Non-Recursive In-Place FFT Algorithm Idea: "Unwind the in-place recursive algorithm and work