Calibrating misspecified ERGMs for Bayesian inference Nial Friel University College Dublin nial.friel@ucd.ie December, 2015 Joint with Lampros Bouranis, Florian Maire.
Motivation ◮ There are many statistical models with intractable (or difficult to evaluate) likelihood functions. ◮ Composite likelihoods provide a generic approach to overcome this computational difficulty. ◮ A natural idea in a Bayesian context is to consider the approximate posterior distribution π cl ( θ | y ) ∝ f cl ( y | θ ) π ( θ ) . ◮ Surprisingly, there has been very little study of such a mis-specified posterior distribution.
Motivation Algorithm Target Pseudoposterior Calibrated pseudoposterior 5 20 4 15 3 10 2 5 1 0 0 −5.5 −5.0 −4.5 −4.0 −0.5 −0.4 −0.3 −0.2 θ 1 (Edges) θ 2 (2−stars)
Introduction ◮ We focus on the exponential random graph model – widely used in statistical network analysis. ◮ The pseudolikelihood function provides a low-dimensional approximation of the ERG likelihood. ◮ We provide a framework which allows one to calibrate the pseudo-posterior distribution. ◮ In experiments our approach provided improved statistical efficiency wrt to more computationally demanding Monte Carlo approaches.
Exponential random graph model f ( y | θ ) = exp { θ T s ( y ) } = q θ ( y ) z ( θ ) . z ( θ ) ◮ y observed adjaceny matrix with n nodes where y ij = 1, if there an edge connecting nodes i and j ; otherwise, y ij = 0. ◮ s ( y ) ∈ R k is a known vector of sufficient statistics. ◮ θ ∈ R k is a vector of parameters. ◮ z ( θ ) is a normalizing constant. � exp { θ t s ( y ) } . z ( θ ) = all possible graphs ◮ 2( n 2 ) possible undirected graphs of n nodes. ◮ Calculation of z ( θ ) is infeasible for non-trivially graphs.
Exponential random graph model f ( y | θ ) = exp { θ T s ( y ) } = q θ ( y ) z ( θ ) . z ( θ ) ◮ y observed adjaceny matrix with n nodes where y ij = 1, if there an edge connecting nodes i and j ; otherwise, y ij = 0. ◮ s ( y ) ∈ R k is a known vector of sufficient statistics. ◮ θ ∈ R k is a vector of parameters. ◮ z ( θ ) is a normalizing constant. � exp { θ t s ( y ) } . z ( θ ) = all possible graphs ◮ 2( n 2 ) possible undirected graphs of n nodes. ◮ Calculation of z ( θ ) is infeasible for non-trivially graphs.
Model specification: Network statistics edge mutual edge 2-in-star 2-out-star 2-mixed-star transitive triad cyclic triad edge 2-star 3-star triangle
Pseudolikelihood approximation (Besag, ’74), (Strauss and Ikeda, ’90) � f pl ( y | θ ) = p ( y ij | y − ij , θ ) i � = j p ( y ij = 1 | y − ij , θ ) y ij � = { 1 − p ( y ij = 1 | y − ij , θ ) } y ij − 1 , i � = j where y − ij denotes y \ y ij . � Each factor in the product is a Bernoulli random variable. � Estimation is equivalent to logistic regression. � Assumes the collection { y ij | y − ij } are mutually independent.
Pseudolikelihood approximation (Besag, ’74), (Strauss and Ikeda, ’90) � f pl ( y | θ ) = p ( y ij | y − ij , θ ) i � = j p ( y ij = 1 | y − ij , θ ) y ij � = { 1 − p ( y ij = 1 | y − ij , θ ) } y ij − 1 , i � = j where y − ij denotes y \ y ij . � Each factor in the product is a Bernoulli random variable. � Estimation is equivalent to logistic regression. � Assumes the collection { y ij | y − ij } are mutually independent.
Bayesian inference � q ( y | θ ) � 1 π ( θ | y ) = z ( θ ) p ( θ ) · π ( y ) ◮ Challenging to sample from the posterior distribution. ◮ π ( θ | y ) often called a doubly–intractable distribution . 1. Approximate Exchange algorithm (AEA) (Caimo and Friel, 2011). 2. Bottleneck: requires a sample from f ( y | θ ).
Exchange algorithm (Murray et al, 2006) ◮ An auxiliary variable scheme to sample from the augmented distribution: π ( θ ′ , y ′ , θ | y ) ∝ f ( y | θ ) · π ( θ ) · h ( θ ′ | θ ) · f ( y ′ | θ ′ ) , (1) ◮ p ( y ′ y ′ y ′ | θ ′ ): the same distribution as the original distribution on which the data y is defined. ◮ h ( θ ′ | θ ) arbitrary distribution for the augmented variable θ ′ . ◮ Crucially, this require a draw from f ( y ′ | θ ′ ) at each iteration. Perfect sampling is not feasible for ERGMs. ◮ Pragmatic solution: Run M transitions of a Markov chain targetting f ( y | θ ′ ).
Algorithm 1: Approximate Exchange algorithm (AEA) 1 Input: initial setting θ , number of iterations T.; 2 Output: A realization of length T from π ( θ | y ) ; 3 for t = 1 , . . . , T do Propose θ ′ ∼ h ( ·| θ ( t ) ); 4 Propose y ′ ∼ R M ( ·| θ ′ ) [”tie-no-tie” (TNT) sampler]; 5 Exchange move from ( θ ( t ) , y ) , ( θ ′ , y ′ ) to ( θ ′ , y ) , ( θ, y ′ ) with prob 6 � ❍❍❍❍ ✟ � ✟✟✟✟ 1 , q ( y ′ | θ ( t ) ) h ( θ ( t ) | θ ′ ) z ( θ ( t ) ) · z ( θ ′ ) p ( θ ′ ) q ( y | θ ′ ) α = min q ( y ′ | θ ′ ) × 7 q ( y | θ ( t ) ) p ( θ ( t ) ) h ( θ ′ | θ ( t ) ) z ( θ ′ ) · z ( θ ( t ) ) ❍ θ ( t +1) ← θ ′ ; 8 end The Bergm package in R implements the AEA (Caimo and Friel, 2014). (See Anto’s tutorial for more details).
◮ Intuitively one expects that the number of auxiliary iterations, M , is proportional to # of dyads of the graph, n 2 . ◮ This is supported by: Invariant distribution of approximate exchange converges to the true target as # of auxiliary iterations, M, increases (Everitt, 2012). Exponentially slow convergence of TNT sampling from an ERG model (Bhamidi et al., 2011). ◮ Conservative approach: choose a large M... ◮ Computationally intensive procedure for larger graphs due to exponentially long mixing time for auxiliary draw from the likelihood.
Pseudo-posterior distribution ◮ Replace true likelihood f ( y | θ ) with a misspecified pseudolikelihood. π pl ( θ | y ) ∝ f pl ( y | θ ) · π ( θ ) ◮ Straightforward to sample from π pl ( θ | y ) using an MH sampler. Calibration approach 1. Mode adjustment of π pl ( θ | y ). 2. Curvature adjustment of π pl ( θ | y ).
Calibration approach Notation: ◮ π : the target distribution. ◮ ν ( θ ) = π pl ( θ | y ) the misspecified target. ◮ ν 1 ( θ ) = π (1) pl ( θ | y ) the mean–adjusted target. ◮ ν 2 ( θ ) = π (2) pl ( θ | y ) the fully calibrated target after curvature adjustment. Θ π = θ ∗ , H π ( θ ) | θ ∗ = H ∗ , arg max Θ ν = ˆ θ PL = ˆ arg max θ PL , H ν ( θ ) | ˆ H PL . Objective Given a sample from ν , find a mapping φ : Θ → Θ, where the corrected samples φ ( θ ) = ( φ ( θ 1 ) , φ ( θ 2 ) , . . . ) satisfy: ν 2 = θ ∗ , H ν 2 ( θ ) | θ ∗ = H ∗ . arg max θ
Calibration approach Notation: ◮ π : the target distribution. ◮ ν ( θ ) = π pl ( θ | y ) the misspecified target. ◮ ν 1 ( θ ) = π (1) pl ( θ | y ) the mean–adjusted target. ◮ ν 2 ( θ ) = π (2) pl ( θ | y ) the fully calibrated target after curvature adjustment. Θ π = θ ∗ , H π ( θ ) | θ ∗ = H ∗ , arg max Θ ν = ˆ θ PL = ˆ arg max θ PL , H ν ( θ ) | ˆ H PL . Objective Given a sample from ν , find a mapping φ : Θ → Θ, where the corrected samples φ ( θ ) = ( φ ( θ 1 ) , φ ( θ 2 ) , . . . ) satisfy: ν 2 = θ ∗ , H ν 2 ( θ ) | θ ∗ = H ∗ . arg max θ
Calibration approach Notation: ◮ π : the target distribution. ◮ ν ( θ ) = π pl ( θ | y ) the misspecified target. ◮ ν 1 ( θ ) = π (1) pl ( θ | y ) the mean–adjusted target. ◮ ν 2 ( θ ) = π (2) pl ( θ | y ) the fully calibrated target after curvature adjustment. Θ π = θ ∗ , H π ( θ ) | θ ∗ = H ∗ , arg max Θ ν = ˆ θ PL = ˆ arg max θ PL , H ν ( θ ) | ˆ H PL . Objective Given a sample from ν , find a mapping φ : Θ → Θ, where the corrected samples φ ( θ ) = ( φ ( θ 1 ) , φ ( θ 2 ) , . . . ) satisfy: ν 2 = θ ∗ , H ν 2 ( θ ) | θ ∗ = H ∗ . arg max θ
Our approach requires estimation of the MAP and Hessian of π ( θ | y ). Two key facts: 1. ∇ θ log π ( θ | y ) = s ( y ) − E y | θ [ s ( y )] + ∇ θ log π ( θ ) . ∇ 2 θ log π ( θ | y ) = V y | θ [ s ( y )] + ∇ 2 2. θ log π ( θ ) .
Mean Adjustment (correct the mode of ν ) ◮ Need ν 1 ( θ ) = ν ( θ − τ 0 ), τ 0 = θ ∗ − ˆ θ pl , to admit θ ∗ as its mode. ◮ Mapping: φ 1 : θ → θ + θ ∗ − ˆ θ pl . ◮ Denote ξ = ( ξ 1 = φ 1 ( θ 1 ) , ξ 2 = φ 1 ( θ 2 ) , . . . ). ◮ θ ∗ (stochastic optimization): ∇ θ log f ( y | θ ) = E y | θ [ s ( y )]. ◮ ˆ θ pl (BFGS): standard logistic regression theory.
Curvature Adjustment (Match H ν 1 ( θ ) | θ ∗ with H ∗ ) ◮ Obtain ν 2 using ν 2 ( θ ) = ν 1 ( W ( θ − θ ∗ ) + θ ∗ ), for some W ∈ M d ( R ) so that: H ν 2 ( θ ) | θ ∗ = W T H ν 1 ( θ ) | θ ∗ W = W T ˆ H PL W . ◮ Sufficient to choose W = M − 1 N where − H ∗ = N T N and − ˆ H PL = M T M . ◮ Samples ζ i = φ ( θ i ) = φ 2 ◦ φ 1 ( θ i ) obtained through φ : θ → V 0 ( θ + 2 θ ∗ − ˆ 0 V 0 ) − 1 = W T W . θ PL ) − θ ∗ , where ( V T See Ribatet et al. (2012) for a similar approach. Note: ◮ φ 1 and φ 2 are non–commutative operators. ◮ Samples ζ i = φ 2 ◦ φ 1 ( θ i ) � = ζ ′ i = φ 1 ◦ φ 2 ( θ i ).
Recommend
More recommend