Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani
Approximate inference Approximate inference techniques Deterministic approximation Variational algorithms Stochastic simulation / sampling methods 2
Sampling-based estimation Assume that = 𝑦 (1) , … , 𝑦 (𝑂) shows the set of i.i.d. samples drawn from the desired distribution 𝑄 For any distribution 𝑄 , function 𝑔 , we can estimate 𝐹 𝑄 𝑔 : 𝑂 𝐹 𝑄 𝑔 ≈ 1 𝑔 𝑦 𝑜 𝑂 𝑜=1 Empirical expectation Expectations reveal interesting properties about distribution 𝑄 Means and variance of 𝑄 Probability of events E.g., we can find 𝑄(𝑦 = 𝑙) by estimating 𝐹 𝑄 𝑔 where 𝑔 𝑦 = 𝐽 𝑦 = 𝑙 We can use a stochastic representation of a complex distribution 3
Bounds on error Hoeffding bound additive bound on error ≤ 2𝑓 −2𝑂𝜗 2 𝑄 𝜄 ∉ 𝜄 − 𝜗, 𝜄 + 𝜗 1 𝑂 𝑔 𝒚 (𝑜) where 𝒚 (𝑜) ~𝑄(𝒚) 𝑂 𝑜=1 𝜄 = 𝜄 = 𝐹 𝑄 𝒚 𝑔 𝒚 Chernouf bound ≤ 2𝑓 − 𝑂𝜄𝜗 2 𝑄 𝜄 ∉ 𝜄 1 − 𝜗 , 𝜄 1 + 𝜗 3 multiplicative bound on error 4
The mean and variance of the estimator For samples drawn independently from the distribution 𝑄 : 𝑂 𝑔 = 1 𝑔 𝑦 𝑜 𝑂 𝑜=1 𝐹 𝑔 = 𝐹 𝑔 𝑔 = 1 𝑤𝑏𝑠 2 𝑀 𝐹 𝑔 − 𝐹 𝑔 5
Monte Carlo methods Using a set of samples to find the answer of an inference query expectations can be approximated using sample-based averages Asymptotically exact and easy to apply to arbitrary problems Challenges: Drawing samples from many distributions is not trivial Are the gathered samples enough? Are all samples useful, or equally useful? 6
Generating samples form a distribution Assume that we have an algorithm that generates (pseudo-) random numbers distributed uniformly over (0,1) How do we generate a sample from these distributions. First, we see simple cases: Bernoulli Multinomial Other standard distributions 7
Transformation technique We intend to generate samples form standard distributions map the values generated by uniform random number generator such the resulting mapped samples have the desired distribution. Choose function 𝑔 . such that the resulting values of 𝑧 = 𝑔 𝑦 have some specific desired distribution 𝑄 𝑧 : 𝑒𝑦 𝑄 𝑧 = 𝑄 𝑦 𝑒𝑧 Since 𝑄 𝑦 = 1 , we have: 𝑧 𝑄 𝑧 ′ 𝑒𝑧′ 𝑦 = −∞ 𝑧 𝑄 𝑧 ′ 𝑒𝑧′ ⇒ 𝑧 = ℎ −1 𝑦 If we define ℎ 𝑧 ≡ −∞ 8
Transformation technique Cumulative CDF Sampling: If 𝑦~𝑉(0,1) , and ℎ(. ) is the CDF of 𝑄 , then ℎ −1 (𝑦) ~𝑄 . Since we need to calculate and then invert the indefinite integral of 𝑄 , it will only be feasible for a limited number of simple distributions Thus, we will see first rejection sampling and importance sampling (in the next slides) that can be used as important components in the more general sampling techniques. 9
Rejection sampling Suppose we wish to sample from 𝑄(𝒚) = 𝑄(𝒚)/𝑎 . 𝑄(𝒚) is difficult to sample, but 𝑄(𝒚) is easy to evaluate We choose a simpler (proposal) distribution 𝑅 𝒚 that we can sample from it more easily Where ∃𝑙, 𝑙𝑅(𝒚) ≥ 𝑄 𝒚 Sample from 𝑅 𝒚 : 𝒚 ∗ ~𝑅(𝒚) 𝑄 𝒚 ∗ accept 𝒚 ∗ with probability 𝑙𝑅(𝒚 ∗ ) 𝑙𝑅(𝑦) 𝑄(𝑦) 10 𝑦 ∗ 𝑦
Rejection sampling Correctness: 𝑄 𝒚 𝑙𝑅(𝒚) 𝑅(𝒚) 𝑄 𝒚 = 𝑄 𝒚 𝑒𝒚 = 𝑄 𝒚 𝑄 𝒚 𝑙𝑅 𝒚 𝑅 𝒚 𝑒𝒚 Probability of acceptance: 𝑄 𝒚 𝑒𝒚 𝑄 𝒚 𝑄 𝑏𝑑𝑑𝑓𝑞𝑢 = 𝑙𝑅 𝒚 𝑅 𝒚 𝑒𝒚 = 𝑙 11
Adaptive rejection sampling It is difficult to determine a suitable analytic form for 𝑅 We can use envelope functions to define 𝑅 when 𝑄(𝑦) is log concave Intersections of tangent lines are used to construct 𝑅 Initially, gradient are evaluated at some initial set of grid points and tangent lines are found accordingly. In each iteration, a sample can be drawn from the envelope distribution. the envelope distribution comprises a piecewise exponential distribution and drawing a sample from it is straight forward If the sample is rejected, then it is incorporated into the set of grid points, a new tangent line is computed, and 𝑅 is thereby refined. ln 𝑄(𝑦) 12 𝑦 1 𝑦 2 𝑦 3 𝑦
High dimensional rejection sampling Problem: low acceptance rate rejection sampling in high dimensional spaces exponential decrease of acceptance rate with dimensionality Example: Using 𝑅 = 𝑂(𝝂, 𝜏 𝑟 2 𝑱) to sample 𝑄 = 𝑂(𝝂, 𝜏 𝑞 2 𝑱) If 𝜏 𝑟 exceeds 𝜏 𝑞 by 1%, and 𝑒 = 1000 𝑒 𝜏 𝑟 ≈ 20,000 and so the optimal acceptance 𝜏 𝑄 rate is 1/20,000 that is too small 𝑄(𝑦) 13 𝑦
Importance sampling Suppose sampling from 𝑄 is hard. Simpler proposal distribution 𝑅 is used instead. If 𝑅 dominates 𝑄 (i.e., 𝑅(𝒚) > 0 whenever 𝑄(𝒚) > 0 ), we can sample from 𝑅 and reweight the obtained samples: = 𝑔 𝒚 𝑄 𝒚 𝑒𝒚 = 𝑔 𝒚 𝑄 𝒚 𝐹 𝑄 𝑔 𝒚 𝑅 𝒚 𝑅 𝒚 𝑒𝒚 𝑄 𝒚 (𝑜) 1 𝒚 𝑜 ~𝑅 𝒚 𝑂 𝑔 𝒚 𝑜 𝑂 𝑜=1 𝐹 𝑄 𝑔 𝒚 ≈ 𝑅 𝒚 𝑜 𝑂 ≈ 1 𝑔 𝒚 𝑜 𝑥 (𝑜) 𝐹 𝑄 𝑔 𝒚 𝑂 𝑥 (𝑜) = 𝑄 𝒚 (𝑜) 𝑅 𝒚 𝑜 𝑜=1 14
Normalized importance sampling Suppose that we can only evaluate 𝑄(𝒚) where 𝑄 𝒚 = 𝑄(𝒚)/𝑎 : = 𝑔 𝒚 𝑄 𝒚 𝑒𝒚 = 𝑎 𝑅 𝑄 𝒚 𝐹 𝑄 𝑔 𝒚 𝑔 𝒚 𝑅 𝒚 𝑅 𝒚 𝑒𝒚 𝑎 𝑄 𝑎 𝑄 = 1 𝑄 𝒚 𝑄 𝒚 𝑒𝒚 = 𝑅 𝒚 𝑅 𝒚 𝑒𝒚 = 𝑠 𝒚 𝑅 𝒚 𝑒𝒚 𝑎 𝑅 𝑎 𝑅 𝑄 𝒚 𝑔 𝒚 𝑠 𝒚 𝑅 𝒚 𝑒𝒚 𝑠 𝒚 = 𝐹 𝑄 𝑔 𝒚 = 𝑅 𝒚 𝑠 𝒚 𝑅 𝒚 𝑒𝒚 1 𝑂 𝑔 𝒚 𝑜 𝑠 (𝑜) 𝑂 𝑜=1 𝒚 𝑜 ~𝑅 𝒚 𝐹 𝑄 𝑔 𝒚 ≈ 1 𝑂 𝑠 (𝑛) 𝑂 𝑛=1 𝑂 𝑔 𝒚 𝑜 𝑥 (𝑜) 𝐹 𝑄 𝑔 𝒚 ≈ 𝑠 (𝑜) 𝑥 (𝑜) = 𝑂 𝑠 (𝑛) 𝑛=1 𝑜=1 15
Importance sampling: problem Importance sampling depends on how well 𝑅 matches 𝑄 For mismatch distributions, weights may be dominated by few samples having large weights, with the remaining weights being relatively insignificant It is common that 𝑄(𝒚)𝑔(𝒚) is strongly varying and has a significant proportion of its mass concentrated in a small region The problem is severe if none of the samples falls in the regions where 𝑄(𝒚)𝑔(𝒚) is large. The estimate of the expectation may be severely wrong w hile the variance of 𝑠 (𝑜) can be small 𝑔(𝑦) 𝑄(𝑦) 𝑅(𝑦) A key requirement for 𝑅(𝒚) is that it should not be small or zero in regions where 𝑄(𝒚) may be significant. 16 [Bishop book]
Sampling methods for graphical models DGMs: Forward (or ancestral) sampling Likelihood weighted sampling For UGMs, there is no one-pass sampling strategy that will sample even from the prior distribution with no observed variables. Instead, computationally more expensive techniques such as Gibbs sampling exist that will be introduced in the next slides 17
Sampling the joint distribution represented by a BN Sample the joint distribution by ancestral sampling Example: Sample from 𝑄(𝐸) ⇒ 𝐸 = 𝑒 1 Sample from 𝑄(𝐽) ⇒ 𝐽 = 𝑗 0 Sample from 𝑄 𝐻 𝑗 0 , 𝑒 1 ⇒ 𝐻 = 3 Sample from 𝑄 𝑇 𝑗 0 ⇒ 𝑇 = 𝑡 0 Sample from 𝑄 𝑀 3 ⇒ 𝑀 = 𝑚 0 One sample 𝑒 1 , 𝑗 0 , 3 , 𝑡 0 , 𝑚 0 was generated 18
Forward sampling in a BN Given a BN, and number of samples 𝑂 Choose a topological ordering on variables, e.g., 𝑌 1 , … , 𝑌 𝑁 For j = 1 to N For i = 1 to M (𝑘) from the distribution 𝑄(𝑌 𝑗 |𝒚 𝑄𝑏 𝑌𝑗 (𝑘) ) Sample 𝑦 𝑗 (𝑘) , … , 𝑦 𝑁 (𝑘) } to the sample set Add {𝑦 1 19
Sampling for conditional probability query 𝑄 𝑗 1 𝑚 0 , 𝑡 0 ) =? Looking at the samples we can count: 𝑂 : # of samples 𝑂 𝑓 : # of samples in which the evidence holds (𝑀 = 𝑚 0 , 𝑇 = 𝑡 0 ) 𝑂 𝐽 : # of samples where the joint is true (𝑀 = 𝑚 0 , 𝑇 = 𝑡 0 , 𝐽 = 𝑗 1 ) For a large enough 𝑂 𝑂 𝑓 /𝑂 ≈ 𝑄(𝑚 0 , 𝑡 0 ) 𝑂 𝐽 /𝑂 ≈ 𝑄(𝑗 1 , 𝑚 0 , 𝑡 0 ) And so, we can set 𝑄(𝑗 1 ,𝑚 0 ,𝑡 0 ) 𝑄 𝑗 1 𝑚 0 , 𝑡 0 ) = 𝑄(𝑚 0 ,𝑡 0 ) ≈ 𝑂 𝐽 /𝑂 𝑓 20
Using rejection sampling to compute 𝑄(𝒀|𝒇) Given a BN, a query 𝑄(𝒀|𝒇) , and number of samples 𝑂 Choose a topological ordering on variables, e.g., 𝑌 1 , … , 𝑌 𝑁 j=1 While j<N For i = 1 to M (𝑘) from the distribution 𝑄(𝑌 𝑗 |𝒚 𝑄𝑏 𝑌𝑗 (𝑘) ) Sample 𝑦 𝑗 (𝑘) , … , 𝑦 𝑁 (𝑘) } consistent with evidence 𝒇 add it to sample set and If {𝑦 1 j=j+1 Use samples to compute 𝑄(𝒀|𝒇) 21
Recommend
More recommend