linear regression prior linear regression posterior
play

Linear Regression: Prior Linear Regression: Posterior 4 4 2 2 0 - PowerPoint PPT Presentation

Monte Carlo Monte Carlo and Insomnia Enrico Fermi (19011954) took great delight in astonishing his colleagues with his remarkably accurate Monte Carlo basics and Motivation predictions of experimental Rejection sampling results. .


  1. Monte Carlo Monte Carlo and Insomnia Enrico Fermi (1901–1954) took great delight in astonishing his colleagues with his remarkably accurate — Monte Carlo basics and Motivation predictions of experimental — Rejection sampling results. . . he revealed that his — Importance sampling “guesses” were really derived from the — Next time: Markov chain Monte Carlo statistical sampling techniques that he used to calculate with whenever insomnia struck in the wee morning hours! — The beginning of the Monte Carlo method , N. Metropolis Iain Murray http://iainmurray.net/ Linear Regression: Prior Linear Regression: Posterior 4 4 2 2 0 0 −2 −2 Prior P ( θ ) −4 −4 P ( θ | Data) ∝ P (Data | θ ) P ( θ ) −6 −6 −2 0 2 4 −2 0 2 4 Input → output mappings considered plausible before seeing data. Posterior much more compact than prior.

  2. Linear Regression: Posterior Model mismatch 0 2 0 −0.5 −2 −1 −4 −1.5 P ( θ | Data) ∝ P (Data | θ ) P ( θ ) −6 −2 0 2 −2 0 2 4 What will Bayesian linear regression do? Draws from posterior. Non-linear error envelope. Possible explanations linear. Quiz ‘Underfitting’ Given a (wrong) linear assumption , which explanations are 4 typical of the posterior distribution? 2 2 2 2 0 0 0 0 −2 −2 −2 A B C −4 −4 −4 −2 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −4 D All of the above −6 E None of the above −4 −2 0 2 4 Z Not sure Posterior very certain despite blatant misfit. Prior ruled out truth.

  3. Microsoft Kinect (Shotton et al., 2011) The need for integrals � p ( y ∗ | x ∗ , D ) = d θ p ( y ∗ , θ | x ∗ , D ) � = d θ p ( y ∗ | x ∗ , θ, � D ) p ( θ | ✚✚✚ � x ∗ , D ) � ✚ y Eyeball modelling assumptions Generate training data p ( y ∗ | x ∗ , D ) Random forest applied to fantasies x ∗ A statistical problem Simple Monte Carlo In general: What is the average height of the people in this room? Method: measure our heights, add them up and divide by N . S � f ( x ) P ( x ) d x ≈ 1 � x ( s ) ∼ P ( x ) f ( x ( s ) ) , S s =1 What is the average height f of people p in Edinburgh E ? Example: making predictions E p ∈E [ f ( p )] ≡ 1 � f ( p ) , “intractable”? � |E| P ( y ∗ | x ∗ , D ) = P ( y ∗ | x ∗ , θ ) p ( θ |D ) d θ p ∈E S S ≈ 1 � ≈ 1 � p ( s ) � � θ ( s ) ∼ p ( θ |D ) for random survey of S people { p ( s ) } ∈ E f , P ( y ∗ | x ∗ , θ ( s ) ) , S S s =1 s =1 Surveying works for large and notionally infinite populations. Many other integrals appear throughout statistical machine learning

  4. Properties of Monte Carlo Aside: don’t always sample! S � f ≡ 1 � x ( s ) ∼ P ( x ) f ( x ) P ( x ) d x ≈ ˆ “Monte Carlo is an extremely bad method; f ( x ( s ) ) , Estimator: S s =1 it should be used only when all alternative methods are worse.” Estimator is unbiased: S — Alan Sokal, 1996 � � = 1 � ˆ f E P ( x ) [ f ( x )] = E P ( x ) [ f ( x )] E P ( { x ( s ) } ) S s =1 Variance shrinks ∝ 1 /S : S � � 1 � ˆ var P ( { x ( s ) } ) f = var P ( x ) [ f ( x )] = var P ( x ) [ f ( x )] /S S 2 s =1 √ “Error bars” shrink like S A dumb approximation of π Alternatives to Monte Carlo There are other methods of numerical integration! � 1 0 <x< 1 and 0 <y< 1 P ( x, y ) = 0 otherwise Example: (nice) 1D integrals are easy: �� octave:1> 4 * quadl(@(x) sqrt(1-x.^2), 0, 1, tolerance) � ( x 2 + y 2 ) < 1 � π = 4 P ( x, y ) d x d y I Gives π to 6 dp’s in 108 evaluations, machine precision in 2598. (NB Matlab’s quadl fails at tolerance=0 , but Octave works.) octave:1> S=12; a=rand(S,2); 4*mean(sum(a.*a,2)<1) ans = 3.3333 In higher dimensions sometimes deterministic approximations work: octave:1> S=1e7; a=rand(S,2); 4*mean(sum(a.*a,2)<1) Variational Bayes, Laplace, . . . (covered later) ans = 3.1418

  5. Reminder Sampling simple distributions Want to sample to approximate expectations: Use library routines for S � univariate distributions f ( x ) P ( x ) d x ≈ 1 � x ( s ) ∼ P ( x ) f ( x ( s ) ) , (and some other special cases) S s =1 This book (free online) explains how How do we get the samples? some of them work http://luc.devroye.org/rnbookindex.html Sampling discrete values Sampling from densities How to convert samples from a Uniform[0,1] generator: 1 � y h ( y ) −∞ p ( y ′ ) d y ′ h ( y ) = p ( y ) u ∼ Uniform[0,1] Sample, y ( u ) = h − 1 ( u ) u ∼ Uniform[0 , 1] 0 y ⇒ u =0 . 4 x = b Figure from PRML, Bishop (2006) Although we can’t always compute and invert h ( y ) There are more efficient ways for large numbers of values and samples. See Devroye book.

  6. Sampling from densities Rejection sampling Sampling from π ( x ) using tractable q ( x ) : Draw points uniformly under the curve: P ( x ) x x (1) x (4) x (2) x (3) Probability mass to left of point ∼ Uniform[0,1] Figure credit: Ryan P. Adams Importance sampling Importance sampling (2) If only know P ( x ) = P ∗ ( x ) / Z P up to constant: Rewrite integral: expectation under simple distribution Q : S � f ( x ( s ) ) P ∗ ( x ( s ) ) f ( x ) P ( x ) d x ≈ Z Q 1 � x ( s ) ∼ Q ( x ) � � , f ( x ) P ( x ) Q ∗ ( x ( s ) ) Z P S f ( x ) P ( x ) d x = Q ( x ) Q ( x ) d x, s =1 � �� � w ∗ ( s ) S f ( x ( s ) ) P ( x ( s ) ) ≈ 1 � x ( s ) ∼ Q ( x ) S w ∗ ( s ) 1 ✄ Q ( x ( s ) ) , ✄ � ✄ f ( x ( s ) ) ≈ S ✄ ✄ � 1 s ′ w ∗ ( s ′ ) S ✁✁ s =1 ✄ ✄ ✁ S s =1 ✁ This estimator is consistent but biased Simple Monte Carlo applied to any integral. � Exercise: Prove that Z P / Z Q ≈ 1 s w ∗ ( s ) Unbiased and independent of dimension? S

  7. Application to large problems Summary so far Approximations scale badly with dimensionality • Monte Carlo approximate expectations with a sample average Q ( x ) = N (0 , σ 2 I ) P ( x ) = N (0 , I ) , Example: • Rejection sampling draw samples from complex distributions Rejection sampling: Requires σ ≥ 1 . Fraction of proposals accepted = σ − D • Importance sampling apply Monte Carlo to ‘any’ sum/integral Importance sampling: � � D/ 2 σ 2 Var [ P ( x ) /Q ( x )] = − 1 Next: High dimensional problems: MCMC 2 − 1 /σ 2 √ Infinite / undefined variance if σ ≤ 1 / 2

Recommend


More recommend