Monte Carlo Monte Carlo and Insomnia Enrico Fermi (1901–1954) took great delight in astonishing his colleagues with his remarkably accurate — Monte Carlo basics and Motivation predictions of experimental — Rejection sampling results. . . he revealed that his — Importance sampling “guesses” were really derived from the — Next time: Markov chain Monte Carlo statistical sampling techniques that he used to calculate with whenever insomnia struck in the wee morning hours! — The beginning of the Monte Carlo method , N. Metropolis Iain Murray http://iainmurray.net/ Linear Regression: Prior Linear Regression: Posterior 4 4 2 2 0 0 −2 −2 Prior P ( θ ) −4 −4 P ( θ | Data) ∝ P (Data | θ ) P ( θ ) −6 −6 −2 0 2 4 −2 0 2 4 Input → output mappings considered plausible before seeing data. Posterior much more compact than prior.
Linear Regression: Posterior Model mismatch 0 2 0 −0.5 −2 −1 −4 −1.5 P ( θ | Data) ∝ P (Data | θ ) P ( θ ) −6 −2 0 2 −2 0 2 4 What will Bayesian linear regression do? Draws from posterior. Non-linear error envelope. Possible explanations linear. Quiz ‘Underfitting’ Given a (wrong) linear assumption , which explanations are 4 typical of the posterior distribution? 2 2 2 2 0 0 0 0 −2 −2 −2 A B C −4 −4 −4 −2 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −4 D All of the above −6 E None of the above −4 −2 0 2 4 Z Not sure Posterior very certain despite blatant misfit. Prior ruled out truth.
Microsoft Kinect (Shotton et al., 2011) The need for integrals � p ( y ∗ | x ∗ , D ) = d θ p ( y ∗ , θ | x ∗ , D ) � = d θ p ( y ∗ | x ∗ , θ, � D ) p ( θ | ✚✚✚ � x ∗ , D ) � ✚ y Eyeball modelling assumptions Generate training data p ( y ∗ | x ∗ , D ) Random forest applied to fantasies x ∗ A statistical problem Simple Monte Carlo In general: What is the average height of the people in this room? Method: measure our heights, add them up and divide by N . S � f ( x ) P ( x ) d x ≈ 1 � x ( s ) ∼ P ( x ) f ( x ( s ) ) , S s =1 What is the average height f of people p in Edinburgh E ? Example: making predictions E p ∈E [ f ( p )] ≡ 1 � f ( p ) , “intractable”? � |E| P ( y ∗ | x ∗ , D ) = P ( y ∗ | x ∗ , θ ) p ( θ |D ) d θ p ∈E S S ≈ 1 � ≈ 1 � p ( s ) � � θ ( s ) ∼ p ( θ |D ) for random survey of S people { p ( s ) } ∈ E f , P ( y ∗ | x ∗ , θ ( s ) ) , S S s =1 s =1 Surveying works for large and notionally infinite populations. Many other integrals appear throughout statistical machine learning
Properties of Monte Carlo Aside: don’t always sample! S � f ≡ 1 � x ( s ) ∼ P ( x ) f ( x ) P ( x ) d x ≈ ˆ “Monte Carlo is an extremely bad method; f ( x ( s ) ) , Estimator: S s =1 it should be used only when all alternative methods are worse.” Estimator is unbiased: S — Alan Sokal, 1996 � � = 1 � ˆ f E P ( x ) [ f ( x )] = E P ( x ) [ f ( x )] E P ( { x ( s ) } ) S s =1 Variance shrinks ∝ 1 /S : S � � 1 � ˆ var P ( { x ( s ) } ) f = var P ( x ) [ f ( x )] = var P ( x ) [ f ( x )] /S S 2 s =1 √ “Error bars” shrink like S A dumb approximation of π Alternatives to Monte Carlo There are other methods of numerical integration! � 1 0 <x< 1 and 0 <y< 1 P ( x, y ) = 0 otherwise Example: (nice) 1D integrals are easy: �� octave:1> 4 * quadl(@(x) sqrt(1-x.^2), 0, 1, tolerance) � ( x 2 + y 2 ) < 1 � π = 4 P ( x, y ) d x d y I Gives π to 6 dp’s in 108 evaluations, machine precision in 2598. (NB Matlab’s quadl fails at tolerance=0 , but Octave works.) octave:1> S=12; a=rand(S,2); 4*mean(sum(a.*a,2)<1) ans = 3.3333 In higher dimensions sometimes deterministic approximations work: octave:1> S=1e7; a=rand(S,2); 4*mean(sum(a.*a,2)<1) Variational Bayes, Laplace, . . . (covered later) ans = 3.1418
Reminder Sampling simple distributions Want to sample to approximate expectations: Use library routines for S � univariate distributions f ( x ) P ( x ) d x ≈ 1 � x ( s ) ∼ P ( x ) f ( x ( s ) ) , (and some other special cases) S s =1 This book (free online) explains how How do we get the samples? some of them work http://luc.devroye.org/rnbookindex.html Sampling discrete values Sampling from densities How to convert samples from a Uniform[0,1] generator: 1 � y h ( y ) −∞ p ( y ′ ) d y ′ h ( y ) = p ( y ) u ∼ Uniform[0,1] Sample, y ( u ) = h − 1 ( u ) u ∼ Uniform[0 , 1] 0 y ⇒ u =0 . 4 x = b Figure from PRML, Bishop (2006) Although we can’t always compute and invert h ( y ) There are more efficient ways for large numbers of values and samples. See Devroye book.
Sampling from densities Rejection sampling Sampling from π ( x ) using tractable q ( x ) : Draw points uniformly under the curve: P ( x ) x x (1) x (4) x (2) x (3) Probability mass to left of point ∼ Uniform[0,1] Figure credit: Ryan P. Adams Importance sampling Importance sampling (2) If only know P ( x ) = P ∗ ( x ) / Z P up to constant: Rewrite integral: expectation under simple distribution Q : S � f ( x ( s ) ) P ∗ ( x ( s ) ) f ( x ) P ( x ) d x ≈ Z Q 1 � x ( s ) ∼ Q ( x ) � � , f ( x ) P ( x ) Q ∗ ( x ( s ) ) Z P S f ( x ) P ( x ) d x = Q ( x ) Q ( x ) d x, s =1 � �� � w ∗ ( s ) S f ( x ( s ) ) P ( x ( s ) ) ≈ 1 � x ( s ) ∼ Q ( x ) S w ∗ ( s ) 1 ✄ Q ( x ( s ) ) , ✄ � ✄ f ( x ( s ) ) ≈ S ✄ ✄ � 1 s ′ w ∗ ( s ′ ) S ✁✁ s =1 ✄ ✄ ✁ S s =1 ✁ This estimator is consistent but biased Simple Monte Carlo applied to any integral. � Exercise: Prove that Z P / Z Q ≈ 1 s w ∗ ( s ) Unbiased and independent of dimension? S
Application to large problems Summary so far Approximations scale badly with dimensionality • Monte Carlo approximate expectations with a sample average Q ( x ) = N (0 , σ 2 I ) P ( x ) = N (0 , I ) , Example: • Rejection sampling draw samples from complex distributions Rejection sampling: Requires σ ≥ 1 . Fraction of proposals accepted = σ − D • Importance sampling apply Monte Carlo to ‘any’ sum/integral Importance sampling: � � D/ 2 σ 2 Var [ P ( x ) /Q ( x )] = − 1 Next: High dimensional problems: MCMC 2 − 1 /σ 2 √ Infinite / undefined variance if σ ≤ 1 / 2
Recommend
More recommend