approximate bayesian computation for astrostatistics
play

Approximate Bayesian Computation for Astrostatistics Jessi Cisewski - PowerPoint PPT Presentation

Approximate Bayesian Computation for Astrostatistics Jessi Cisewski Department of Statistics Yale University October 24, 2016 SAMSI Undergraduate Workshop Our goals Introduction to Bayesian methods Likelihoods, priors, posteriors Learn our


  1. Approximate Bayesian Computation for Astrostatistics Jessi Cisewski Department of Statistics Yale University October 24, 2016 SAMSI Undergraduate Workshop

  2. Our goals Introduction to Bayesian methods Likelihoods, priors, posteriors Learn our ABCs.... Approximate Bayesian Computation Image from http://livingstontownship.org/lycsblog/?p=241 1

  3. Suppose we randomly select a sample of size n women and measure their heights: x 1 , x 2 , x 3 , . . . , x n Let’s assume this follows a Normal distribution : X ∼ N ( µ, σ ) 2 πσ 2 e − ( x − µ )2 1 f ( x ; µ, σ ) = √ 2 σ 2 2

  4. Stellar Initial Mass Function 3

  5. Probability density function (pdf) − → the function f ( x , µ, σ ), with µ fixed and x variable Density 2 πσ 2 e − ( x − µ )2 1 √ f ( x , µ, σ ) = 2 σ 2 µ H e i g h t Height (in) Each observation from this 4

  6. Probability density function (pdf) − → the function f ( x , µ, σ ), with µ fixed and x variable Density 2 πσ 2 e − ( x − µ )2 1 √ f ( x , µ, σ ) = 2 σ 2 µ H e i g h t Height (in) Each observation from this Likelihood − → the function f ( x , µ, σ ), with µ variable and x fixed 2 πσ 2 e − ( x − µ )2 1 Likelihood √ f ( x , µ, σ ) = 2 σ 2 µ X µ 4

  7. Consider a random sample of size n (assuming independence, and a known σ ): X 1 , . . . , X n ∼ N (3 , 2) n 2 πσ 2 e − ( xi − µ )2 1 � f ( x , µ, σ ) = f ( x 1 , . . . , x n , µ, σ )= √ 2 σ 2 i =1 5 n = 50 n = 100 n = 250 (normalized) Likelihood n = 500 4 µ 3 2 1 0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 µ 5

  8. Likelihood Principle All of the information in a sample is contained in the likelihood function, a density or distribution function. The data are modeled by a likelihood function. How do we infer µ ? Or any unknown parameter(s) θ ? 6

  9. Maximum likelihood estimation The parameter value, θ , that maximizes the likelihood: ˆ θ = max f ( x 1 , . . . , x n , θ ) θ MLE 6e-43 µ max µ f ( x 1 , . . . , x n , µ, σ ) = Likelihood 2 πσ 2 e − ( xi − µ )2 � n 1 4e-43 max µ 2 σ 2 √ i =1 2e-43 � n i =1 x i Hence, ˆ µ = = ¯ x n 0e+00 1.0 2.0 3.0 4.0 µ 7

  10. Bayesian framework Suppose θ is our parameter(s) of interest Classical or Frequentist methods for inference consider θ to be fixed and unknown − → performance of methods evaluated by repeated sampling − → consider all possible data sets Bayesian methods consider θ to be random − → only considers observed data set and prior information 8

  11. Posterior distribution Likelihood Prior � �� � ���� Data f ( x | θ ) · π ( θ ) f ( x | θ ) π ( θ ) ���� π ( θ | x ) = = Θ d θ f ( x | θ ) π ( θ ) ∝ f ( x | θ ) π ( θ ) � f ( x ) The prior distribution allows you to “easily” incorporate your beliefs about the parameter(s) of interest Posterior is a distribution on the parameter space given the observed data 9

  12. Data: y 1 , . . . , y 4 ∼ N ( µ = 3 , σ = 2), ¯ y = 1 . 278 Prior: N ( µ 0 = − 3 , σ 0 = 5) Posterior: N ( µ 1 = 1 . 114 , σ 1 = 0 . 981) 0.4 Likelihood Prior Posterior 0.3 Density 0.2 0.1 0.0 -10 -5 0 5 10 µ 10

  13. Data: y 1 , . . . , y 4 ∼ N ( µ = 3 , σ = 2), ¯ y = 1 . 278 Prior: N ( µ 0 = − 3 , σ 0 = 1) Posterior: N ( µ 1 = − 0 . 861 , σ 1 = 0 . 707) 0.6 Likelihood Prior Posterior 0.4 Density 0.2 0.0 -10 -5 0 5 10 µ 11

  14. Data: y 1 , . . . , y 200 ∼ N ( µ = 3 , σ = 2), ¯ y = 2 . 999 Prior: N ( µ 0 = − 3 , σ 0 = 1) Posterior: N ( µ 1 = 2 . 881 , σ 1 = 0 . 140) 3.0 Likelihood Prior Posterior 2.0 Density 1.0 0.0 -6 -4 -2 0 2 4 6 µ 12

  15. Prior distribution The prior distribution allows you to “easily” incorporate your beliefs about the parameter(s) of interest If one has a specific prior in mind, then it fits nicely into the definition of the posterior But how do you go from prior information to a prior distribution? And what if you don’t actually have prior information? 13

  16. Prior distribution The prior distribution allows you to “easily” incorporate your beliefs about the parameter(s) of interest If one has a specific prior in mind, then it fits nicely into the definition of the posterior But how do you go from prior information to a prior distribution? And what if you don’t actually have prior information? And what if you don’t know the likelihood?? 13

  17. Approximate Bayesian Computation “Likelihood-free” approach to approximating π ( θ | x obs ) i.e. f ( x obs | θ ) not specified Proceeds via simulation of the forward process Why would we not know f ( x obs | θ )? 1 Physical model too complex to write as a closed form function of θ 2 Strong dependency in data 3 Observational limitations ABC in Astronomy: Ishida et al. (2015); Akeret et al. (2015); Weyant et al. (2013); Schafer and Freeman (2012); Cameron and Pettitt (2012) 14

  18. 15

  19. Basic ABC algorithm For the observed data x obs and prior π ( θ ): Algorithm ∗ 1 Sample θ prop from prior π ( θ ) 2 Generate x prop from forward process f ( x | θ prop ) 3 Accept θ prop if x obs = x prop 4 Return to step 1 ∗ Introduced in Pritchard et al. (1999) (population genetics) 16

  20. Step 3: Accept θ prop if x obs = x prop Waiting for proposals such that x obs = x prop would be computationally prohibitive Instead, accept proposals with ∆( x obs , x prop ) ≤ ǫ for some distance ∆ and some tolerance threshold ǫ When x is high-dimensional, will have to make ǫ too large in order to keep acceptance probability reasonable. Instead, reduce the dimension by comparing summaries S ( x prop ) and S ( x obs ) 17

  21. Gaussian illustration Data x obs consists of 25 iid draws from Normal( µ, 1) Summary statistics S ( x ) = ¯ x Distance function ∆( S ( x prop ) , S ( x obs )) = | ¯ x prop − ¯ x obs | Tolerance ǫ = 0.50 and 0.10 Prior π ( µ ) = Normal(0,10) 18

  22. Gaussian illustration: posteriors for µ Tolerance: 0.1, N:1000, n:25 Tolerance: 0.5, N:1000, n:25 2.0 True Posterior True Posterior 2.0 ABC Posterior ABC Posterior Input value Input value 1.5 1.5 Density Density 1.0 1.0 0.5 0.5 0.0 0.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 µ µ 19

  23. How to pick a tolerance, ǫ ? Instead of starting the ABC algorithm over with a smaller tolerance ( ǫ ), decrease the tolerance and use the already sampled particle system as a proposal distribution rather than drawing from the prior distribution. Particle system: (1) retained sampled values, (2) importance weights Beaumont et al. (2009); Moral et al. (2011); Bonassi and West (2004) 20

  24. Gaussian illustration: sequential posteriors N:1000, n:25 True Posterior ABC Posterior 2.0 Input value 1.5 Density 1.0 0.5 0.0 -1.0 -0.5 0.0 0.5 1.0 µ Tolerance sequence, ǫ 1:10 : 1.00 0.75 0.53 0.38 0.27 0.19 0.15 0.11 0.08 0.06 21

  25. Stellar Initial Mass Function 22

  26. Stellar Initial Mass Function : the distribution of star masses after a star formation event within a specified volume of space Molecular cloud → Protostars → Stars Image: adapted from http://www.astro.ljmu.ac.uk 23

  27. Broken power-law (Kroupa, 2001) Φ( M ) ∝ M − α i , M 1 i ≤ M ≤ M 2 i α 1 = 0.3 for 0 . 01 ≤ M / M ∗ Sun ≤ 0 . 08 [Sub-stellar] α 2 = 1.3 for 0 . 08 ≤ M / M Sun ≤ 0 . 50 α 3 = 2.3 for 0 . 50 ≤ M / M Sun ≤ M max Many other models, e.g. Salpeter (1955); Chabrier (2003) ∗ 1 M Sun = 1 Solar Mass (the mass of our Sun) 24

  28. ABC for the Stellar Initial Mass Function 600 Frequency 400 200 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Mass Image (left): Adapted from http://www.astro.ljmu.ac.uk 25

  29. IMF Likelihood Start with a power-law distribution : each star’s mass is independently drawn from a power law distribution with density � � 1 − α m − α , f ( m ) = m ∈ ( M min , M max ) M 1 − α max − M 1 − α min Then the likelihood is � � n tot n tot 1 − α � m − α L ( α | m 1: n tot ) = × M 1 − α max − M 1 − α i min i =1 n tot = total number of stars in cluster 26

  30. Observational limitations: aging Lifecycle of star depends on mass → more massive stars die faster Cluster age of τ Myr → only observe stars with masses < T age ≈ τ − 2 / 5 × 10 8 / 5 If age = 30 Myr so the aging cutoff is T age ≈ 10 M Sun Then the likelihood is � n obs � n obs � � 1 − α � m − α × P ( M > T age ) n tot − n obs L ( α | m 1: n obs , n tot ) = T 1 − α − M 1 − α i age min i =1 n tot = # of stars in cluster n obs = # stars observed in cluster Image: http://scioly.org 27

  31. Observational limitations: completeness Completeness function:  0, m < C min   m − C min P (observing star | m ) = C max − C min , m ∈ [ C min , C max ]   1, m > C max Probability of observing a particular star given its mass Depends on the flux limit, stellar crowding, etc. Image: NASA, J. Trauger (JPL), J. Westphal (Caltech) 28

Recommend


More recommend