School of Computer Science Approximate Inference: Monte Carlo Inference Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 18, Nov 19, 2007 Receptor A Receptor A X 1 X 1 X 1 Receptor B Receptor B X 2 X 2 X 2 Eric Xing Eric Xing Kinase C Kinase C X 3 X 3 X 3 Kinase D Kinase D X 4 X 4 X 4 Kinase E Kinase E X 5 X 5 X 5 TF F TF F X 6 X 6 X 6 Reading: J-Chap. 1, KF-Chap. 11 Gene G Gene G X 7 X 7 X 7 X 8 X 8 X 8 Gene H Gene H 1 Monte Carlo methods � Draw random samples from the desired distribution � Yield a stochastic representation of a complex distribution marginals and other expections can be approximated using sample- � based averages 1 N ∑ f x f x = t ( ) [ ( )] ( ) E N t = 1 � Asymptotically exact and easy to apply to arbitrary models � Challenges: how to draw samples from a given dist. (not all distributions can be � trivially sampled)? how to make better use of the samples (not all sample are useful, or � eqally useful, see an example later)? how to know we've sampled enough? � Eric Xing 2 1
Example: naive sampling � Construct samples according to probabilities given in a BN. E0 E0 B0 B0 A0 A0 M0 M0 J0 J0 E0 E0 B0 B0 A0 A0 M0 M0 J0 J0 E0 E0 B0 B0 A0 A0 M0 M0 J1 J1 E0 B0 A0 M0 J0 E0 B0 A0 M0 J0 E0 E0 B0 B0 A0 A0 M0 M0 J0 J0 E0 E0 B0 B0 A0 A0 M0 M0 J0 J0 E1 E1 B0 B0 A1 A1 M1 M1 J1 J1 Alarm example: (Choose the right sampling sequence) E0 E0 B0 B0 A0 A0 M0 M0 J0 J0 1) Sampling:P(B)=<0.001, 0.999> suppose it is false, B0. Same for E0. P(A|B0, E0)=<0.001, 0.999> suppose E0 E0 B0 B0 A0 A0 M0 M0 J0 J0 it is false... 2) Frequency counting: In the samples right, E0 E0 B0 B0 A0 A0 M0 M0 J0 J0 P(J|A0)=P(J,A0)/P(A0)=<1/9, 8/9>. Eric Xing 3 Example: naive sampling � Construct samples according to probabilities given in a BN. E0 B0 A0 M0 J0 Alarm example: (Choose the right sampling E0 B0 A0 M0 J0 sequence) E0 B0 A0 M0 J1 3) what if we want to compute P(J|A1) ? E0 B0 A0 M0 J0 we have only one sample ... P(J|A1)=P(J,A1)/P(A1)=<0, 1>. E0 B0 A0 M0 J0 4) what if we want to compute P(J|B1) ? E0 B0 A0 M0 J0 No such sample available! E1 B0 A1 M1 J1 P(J|A1)=P(J,B1)/P(B1) can not be defined. E0 B0 A0 M0 J0 For a model with hundreds or more variables, E0 B0 A0 M0 J0 rare events will be very hard to garner evough samples even after a long time or sampling ... E0 B0 A0 M0 J0 Eric Xing 4 2
Monte Carlo methods (cond.) � Direct Sampling � We have seen it. � Very difficult to populate a high-dimensional state space � Rejection Sampling Create samples like direct sampling, only count samples which is � consistent with given evidences. � Likelihood weighting, ... Sample variables and calculate evidence weight. Only create the � samples which support the evidences. � Markov chain Monte Carlo (MCMC) Metropolis-Hasting � � Gibbs Eric Xing 5 Rejection sampling � Suppose we wish to sample from dist. Π ( X )= Π ' ( X )/Z. Π ( X ) is difficult to sample, but Π ' ( X ) is easy to evaluate � Sample from a simpler dist Q ( X ) � Rejection sampling � x Q X x x kQ x Π * * * * ~ ( ), accept w.p. ' ( ) / ( ) Correctness: � [ ] Π x kQ x Q x ' ( ) / ( ) ( ) p x = ( ) [ ] ∫ x kQ x Q x dx Π ' ( ) / ( ) ( ) x Π ' ( ) = = Π x ( ) ∫ x dx Π ' ( ) � Pitfall … Eric Xing 6 3
Rejection sampling Pitfall: � Using Q= N ( µ , σ q I ) to sample P= N ( µ , σ p I ) � If σ q exceeds σ p by 1%, and dimensional=1000, � The optimal acceptance rate k=( σ q / σ p ) d ≈ 1/20,000 � � Big waste of samples! � Adaptive rejection sampling Using envelope functions to define Q � Eric Xing 7 Unnormalized importance sampling � Suppose sampling from P (·) is hard. � Suppose we can sample from a "simpler" proposal distribution Q (·) instead. � If Q dominates P (i.e., Q ( x ) > 0 whenever P ( x ) > 0), we can sample from Q and reweight: ∫ X = x P x dx f ( ) f ( ) ( ) P x ( ) ∫ x Q x dx = ( ) ( ) f Q x ( ) 1 P x m ( ) ∑ x m x m Q X ≈ f ( ) where ~ ( ) M Q x m ( ) m 1 ∑ x w = m m f ( ) M m Eric Xing 8 4
Normalized importance sampling � Suppose we can only evaluate P' ( x ) = α P ( x ) (e.g. for an MRF). � We can get around the nasty normalization constant α as follows: P x P x ' ( ) ' ( ) r X r X ∫ Q x dx ∫ P x dx = ⇒ = = = α Let ( ) ( ) ( ) ' ( ) � Q x Q Q x ( ) ( ) � Now 1 P x ' ( ) X ∫ x P x dx ∫ x Q x dx = = ( ) ( ) ( ) ( ) ( ) f f f Q x P α ( ) ∫ x r x Q x dx f ( ) ( ) ( ) = ∫ r x Q x dx ( ) ( ) ∑ x m r m ( ) f m x m Q X ≈ where ~ ( ) ∑ r m m r m ∑ x m w m w m = = f ( ) whe re ∑ r m m m Eric Xing 9 Normalized vs unnormalized importance sampling Unormalized importance sampling is unbiased: � [ ] = ( ) ( ) E Q f X w X Normalized importance sampling is biased, eg for M = 1: � ⎡ 1 1 ⎤ f ( x ) w ( x ) = E Q ⎢ ⎥ 1 ⎣ ⎦ w ( x ) However, the variance of the normalized importance sampler is � usually lower in practice. Also, it is common that we can evaluate P'(x) but not P(x), e.g. � P(x|e) = P'(x, e)/P(e) for Bayes net, or P(x) = P'(x)/Z for MRF. Eric Xing 10 5
Likelhood weighting We now apply normalized importance sampling to a Bayes net. � The proposal Q is gotten from the mutilated BN where we clamp � evidence nodes, and cut their incoming arcs. Call this P M . The unnormalized posterior is P'(x) = P(x, e). � ∑ δ = m ( ) w x x So for f(X i ) = δ (X i = x i ), we get ˆ = = m i i m � P ( X x | e ) ∑ i i w = ′ m m m m where . w P ( x , e ) / P ( x ) m M Eric Xing 11 Likelhood weighting algorithm Eric Xing 12 6
Efficiency of likelihood weighting � The efficiency of importance sampling depends on how close the proposal Q is to the target P. � Suppose all the evidence is at the roots. Then Q = P(X|e), and all samples have weight 1. � Suppose all the evidence is at the leaves. Then Q is the prior, so many samples might get small weight if the evidence is unlikely. � We can use arc reversal to make some of the evidence nodes be roots instead of leaves, but the resulting network can be much more densely connected. Eric Xing 13 Weighted resampling � Problem of importance sampling: depends on how well Q matches P If P(x)f(x) is strongly varying and has a significant proportion of its mass � concentrated in a small region, r m will be dominated by a few samples * * * * * * * * * * Note that if the high-prob mass region of Q falls into the low-prob mass � region of P, the variance of can be small even if the r m P x m Q x m = ( ) / ( ) samples come from low-prob region of P and potentially erroneous . � Solution Use heavy tail Q . � P x m Q x m r m ( ) / ( ) w m = = ∑ ∑ P x l Q x l r m � Weighted resampling ( ) / ( ) l m Eric Xing 14 7
Weighted resampling Sampling importance resampling (SIR): � Draw N samples from Q : X 1 … X N 1. P x m Q x m r m Constructing weights: w 1 … w N , ( ) / ( ) w m = = 2. ∑ ∑ P x l Q x l r m ( ) / ( ) l m Sub-sample x from { X 1 … X N } w.p. ( w 1 … w N ) 3. Particular Filtering � X 1 ... X t X t+1 A special weighted resampler � Yield samples from posterior p( X t | Y 1:t ) Y 1 A A Y t Y t+1 A � Eric Xing 15 Sketch of Particle Filters � The starting point p X p Y X ( | ) ( | ) Y p X p X Y t 1 t 1 t t = = − : ( ) ( | , ) Y Y t 1 t t t 1 t − 1 ∫ p X p Y X dX : : ( | ) ( | ) Y t 1 t 1 t t t − : Thus p( X t | Y 1:t ) is represented by � ⎧ ⎫ ⎪ ⎪ m X m p X w m p Y X = ( | ) ⎨ t t ⎬ ~ ( | Y ), t t 1 t − 1 t M ⎪ : ⎪ p Y X m ∑ ⎩ ( | ) ⎭ t t m 1 = � A sequential weighted resampler Time update � ∫ p X p X X p X dX = ( | Y ) ( | ) ( | Y ) t 1 1 t t 1 t t 1 t t + + : : ∑ w m p X X = ( | ) (sample from a mixture model) t t 1 t ( | ) + p X Y + 1 t t m Measurement update � p X p Y X ( | Y ) ( | ) p X = t + 1 1 t t + 1 t + 1 : ( Y ) t + 1 1 t + 1 : ∫ p X p Y X dX ( | Y ) ( | ) t + 1 1 t t + 1 t + 1 t + 1 : ⎧ ⎫ ⎪ ⎪ m X p X w p Y X ⇒ m m = ( | ) ⎨ ~ ( | ), t + 1 t + 1 ⎬ (reweight) Y t + 1 t + 1 1 t t + 1 M ⎪ : m ⎪ p Y X ⎩ ∑ ⎭ ( | ) t + 1 t + 1 m = 1 Eric Xing 16 8
Recommend
More recommend