Sampling and Monte Carlo Integration Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018
Recap Learning and inference often involves intractable integrals, e.g. ◮ Marginalisation � p ( x ) = p ( x , y ) d y y ◮ Expectations � E [ g ( x ) | y o ] = g ( x ) p ( x | y o ) d x for some function g . ◮ For unobserved variables, likelihood and gradient of the log lik � L ( θ ) = p ( D ; θ ) = p ( u , D ; θ d u ) , u ∇ θ ℓ ( θ ) = E p ( u |D ; θ ) [ ∇ θ log p ( u , D ; θ )] Notation: E p ( x ) is sometimes used to indicate that the expectation is taken with respect to p ( x ). Michael Gutmann Sampling and Monte Carlo Integration 2 / 41
Recap Learning and inference often involves intractable integrals, e.g. ◮ For unnormalised models with intractable partition functions p ( D ; θ ) ˜ L ( θ ) = � x ˜ p ( x ; θ ) d x ∇ θ ℓ ( θ ) ∝ m ( D ; θ ) − E p ( x ; θ ) [ m ( x ; θ )] ◮ Combined case of unnormalised models with intractable partition functions and unobserved variables. ◮ Evaluation of intractable integrals can sometimes be avoided by using other learning criteria (e.g. score matching). ◮ Here: methods to approximate integrals like those above using sampling. Michael Gutmann Sampling and Monte Carlo Integration 3 / 41
Program 1. Monte Carlo integration 2. Sampling Michael Gutmann Sampling and Monte Carlo Integration 4 / 41
Program 1. Monte Carlo integration Approximating expectations by averages Importance sampling 2. Sampling Michael Gutmann Sampling and Monte Carlo Integration 5 / 41
Averages with iid samples ◮ Tutorial 7: For Gaussians, the sample average is an estimate (MLE) of the mean (expectation) E [ x ] n � x = 1 ¯ x i ≈ E [ x ] n i =1 ◮ Gaussianity not needed: assume x i are iid observations of x ∼ p ( x ). � n � x n = 1 E [ x ] = xp ( x ) d x ≈ ¯ ¯ x n x i n i =1 ◮ Subscript n reminds us that we used n samples to compute the average. ◮ Approximating integrals by means of sample averages is called Monte Carlo integration. Michael Gutmann Sampling and Monte Carlo Integration 6 / 41
Averages with iid samples ◮ Sample average is unbiased n � x n ] = 1 = n E [ x i ] ∗ E [¯ n E [ x ] = E [ x ] n i =1 ( ∗ : “identically distributed” assumption is used, not independence) ◮ Variability � n � n � � x n ] = 1 = 1 V [ x i ] = 1 ∗ V [¯ n V [ x ] x i n 2 V n 2 i =1 i =1 ( ∗ : independence assumption used) ◮ Squared error decreases as 1 / n � x n − E [ x ]) 2 � = 1 V [¯ x n ] = E (¯ n V [ x ] Michael Gutmann Sampling and Monte Carlo Integration 7 / 41
Averages with iid samples ◮ Weak law of large numbers: x n − E [ x ] | ≥ ǫ ) ≤ V [ x ] Pr ( | ¯ n ǫ 2 ◮ As n → ∞ , the probability for the sample average to deviate from the expected value goes to zero. ◮ We say that sample average converges in probability to the expected value. ◮ Speed of convergence depends on the variance V [ x ]. ◮ Different “laws of large numbers” exist that make different assumptions. Michael Gutmann Sampling and Monte Carlo Integration 8 / 41
Chebyshev’s inequality ◮ Weak law of large numbers is a direct consequence of Chebyshev’s inequality ◮ Chebyshev’s inequality: Let s be some random variable with mean E [ s ] and variance V [ s ]. Pr ( | s − E [ s ] | ≥ ǫ ) ≤ V [ s ] ǫ 2 ◮ This means that for all random variables: ◮ probability to deviate more than three standard deviation from the mean is less than 1 / 9 ≈ 0 . 11 � (set ǫ = 3 V ( s )) ◮ Probability to deviate more than 6 standard deviations: 1 / 36 ≈ 0 . 03. These are conservative values; for many distributions, the probabilities will be smaller. Michael Gutmann Sampling and Monte Carlo Integration 9 / 41
Proofs (not examinable) ◮ Chebyshev’s inequality follows from Markov’s inequality. ◮ Markov’s inequality: For a random variable y ≥ 0 Pr( y ≥ t ) ≤ E [ y ] ( t > 0) t ◮ Chebyshev’s inequality is obtained by setting y = | s − E [ s ] | � ( s − E [ s ]) 2 ≥ t 2 � Pr ( | s − E [ s ] | ≥ t ) = Pr � ( s − E [ s ]) 2 � ≤ E . t 2 Chebyshev’s inequality follows with t = ǫ , and because E [( s − E [ s ] 2 ] is the variance V [ s ] of s . Michael Gutmann Sampling and Monte Carlo Integration 10 / 41
Proofs (not examinable) Proof for Markov’s inequality: Let t be an arbitrary positive number and y a one-dimensional non-negative random variable with pdf p . We can decompose the expectation of y using t as split-point, � ∞ � t � ∞ E [ y ] = up ( u ) d u = up ( u ) d u + up ( u ) d u . 0 0 t Since u ≥ t in the second term, we obtain the inequality � t � ∞ E [ y ] ≥ up ( u ) d u + tp ( u ) d u . 0 t The second term is t times the probability that y ≥ t , so that � t E [ y ] ≥ up ( u ) d u + t Pr( y ≥ t ) 0 ≥ t Pr( y ≥ t ) , where the second line holds because the first term in the first line is non-negative. This gives Markov’s inequality Pr( y ≥ t ) ≤ E ( y ) ( t > 0) t Michael Gutmann Sampling and Monte Carlo Integration 11 / 41
Averages with correlated samples ◮ When computing the variance of the sample average x n ] = V [ x ] V [¯ n we assumed the samples are identically and independently distributed. ◮ The variance shrinks with increasing n and the average becomes more and more concentrated around E [ x ]. ◮ Corresponding results exist for the case of statistically dependent samples x i . Known as “ergodic theorems”. ◮ Important for the theory of Markov chain Monte Carlo methods but requires advanced mathematical theory. Michael Gutmann Sampling and Monte Carlo Integration 12 / 41
More general expectations ◮ So far, we have considered � n � xp ( x ) d x ≈ 1 E [ x ] = x i n i =1 where x i ∼ p ( x ) ◮ This generalises � n � g ( x ) p ( x ) d x ≈ 1 E [ g ( x )] = g ( x i ) n i =1 where x i ∼ p ( x ) ◮ Variance of the approximation if the x i are iid is 1 n V [ g ( x )] Michael Gutmann Sampling and Monte Carlo Integration 13 / 41
Example (Based on a slide from Amos Storkey) � n � g ( x ) N ( x ; 0 , 1) d x ≈ 1 E [ g ( x )] = g ( x i ) ( x i ∼ N ( x ; 0 , 1)) n i =1 for g ( x ) = x and g ( x ) = x 2 Left: sample average as a function of n Right: Variability (0.5 quantile: solid, 0.1 and 0.9 quantiles: dashed) 3 3 2.5 2.5 2 2 Distribution of the average 1.5 1.5 1 Average 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1.5 -2 -1.5 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of samples Number of samples Michael Gutmann Sampling and Monte Carlo Integration 14 / 41
Example (Based on a slide from Amos Storkey) � n � g ( x ) N ( x ; 0 , 1) d x ≈ 1 E [ g ( x )] = g ( x i ) ( x i ∼ N ( x ; 0 , 1)) n i =1 for g ( x ) = exp(0 . 6 x 2 ) Left: sample average as a function of n Right: Variability (0.5 quantile: solid, 0.1 and 0.9 quantiles: dashed) 10 15 9 8 Distribution of the average 7 10 Average 6 5 4 5 3 2 1 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of samples Number of samples Michael Gutmann Sampling and Monte Carlo Integration 15 / 41
Example ◮ Indicators that something is wrong: ◮ Strong fluctuations in the sample average as n increases. ◮ Large non-declining variability. ◮ Note: integral is not finite: � � 1 exp(0 . 6 x 2 ) N ( x ; 0 , 1) d x = exp(0 . 6 x 2 ) exp( − 0 . 5 x 2 ) d x √ 2 π � 1 exp(0 . 1 x 2 ) d x √ = 2 π = ∞ but for any n , the sample average is finite and may be mistaken for a good approximation. ◮ Check variability when approximating the expected value by a sample average! Michael Gutmann Sampling and Monte Carlo Integration 16 / 41
Approximating general integrals ◮ If the integral does not correspond to an expectation, we can smuggle in a pdf q to rewrite it as an expected value with respect to q � � g ( x ) q ( x ) I = g ( x ) d x = q ( x ) d x � g ( x ) = q ( x ) q ( x ) d x � g ( x ) � = E q ( x ) q ( x ) n � ≈ 1 g ( x i ) q ( x i ) n i =1 with x i ∼ q ( x ) (iid) ◮ This is the basic idea of importance sampling. ◮ q is called the importance (or proposal) distribution Michael Gutmann Sampling and Monte Carlo Integration 17 / 41
Choice of the importance distribution ◮ Call the approximation � I , n � I = 1 g ( x i ) � q ( x i ) n i =1 ◮ � I is unbiased by construction � g ( x ) � � g ( x ) � E [ � I ] = E = q ( x ) q ( x ) d x = g ( x ) d x = I q ( x ) ◮ Variance �� g ( x ) � 2 � � g ( x ) � � � g ( x ) �� 2 � � = 1 = 1 − 1 � I V n V n E E q ( x ) q ( x ) q ( x ) n � �� � I 2 Depends on the second moment. Michael Gutmann Sampling and Monte Carlo Integration 18 / 41
Choice of the importance distribution ◮ The second moment is �� g ( x ) � 2 � � � g ( x ) � 2 � g ( x ) 2 = q ( x ) d x = q ( x ) d x E q ( x ) q ( x ) � | g ( x ) || g ( x ) | = q ( x ) d x ◮ Bad: q ( x ) is small when | g ( x ) | is large. Gives large variance. ◮ Good: q ( x ) is large when | g ( x ) | is large. ◮ Optimal q equals | g ( x ) | q ∗ ( x ) = � | g ( x ) | d x ◮ Optimal q cannot be computed, but justifies the heuristic that q ( x ) should be large when | g ( x ) | is large, or that the ratio | g ( x ) | / q ( x ) should be approximately constant . Michael Gutmann Sampling and Monte Carlo Integration 19 / 41
Recommend
More recommend