bayesian statistics
play

Bayesian statistics DS GA 1002 Probability and Statistics for Data - PowerPoint PPT Presentation

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Frequentist vs Bayesian statistics In frequentist statistics the data are modeled as


  1. Bayesian statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda

  2. Frequentist vs Bayesian statistics In frequentist statistics the data are modeled as realizations from a distribution that depends on deterministic parameters In Bayesian statistics the parameters are modeled as random variables This allows to quantify our prior uncertainty and incorporate additional information

  3. Learning Bayesian models Conjugate priors Bayesian estimators

  4. Prior distribution and likelihood x ∈ R n are a realization of a random vector � The data � X , which depends on a vector of parameters � Θ Modeling choices: ◮ Prior distribution: Distribution of � Θ encoding our uncertainty about the model before seeing the data ◮ Likelihood: Conditional distribution of � X given � Θ

  5. Posterior distribution The posterior distribution is the conditional distribution of Θ given � X Evaluating the posterior at the data � x allows to update our uncertainty about Θ using the data

  6. Bernoulli distribution Goal: Estimating Bernoulli parameter from iid data We consider two different Bayesian estimators Θ 1 and Θ 2 : 1. Θ 1 is a conservative estimator with a uniform prior pdf � 1 for 0 ≤ θ ≤ 1 f Θ 1 ( θ ) = 0 otherwise 2. Θ 2 has a prior pdf skewed towards 1 � 2 θ for 0 ≤ θ ≤ 1 f Θ 2 ( θ ) = 0 otherwise

  7. Prior distributions 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

  8. Bernoulli distribution: likelihood The data are assumed to be iid, so the likelihood is p � X | Θ ( � x | θ )

  9. Bernoulli distribution: likelihood The data are assumed to be iid, so the likelihood is x | θ ) = θ n 1 ( 1 − θ ) n 0 p � X | Θ ( � n 0 is the number of zeros and n 1 the number of ones

  10. Bernoulli distribution: posterior distribution f Θ 1 | � X ( θ | � x )

  11. Bernoulli distribution: posterior distribution f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) f Θ 1 | � X ( θ | � x ) = X ( � x ) p �

  12. Bernoulli distribution: posterior distribution f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) f Θ 1 | � X ( θ | � x ) = X ( � x ) p � f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) = � u f Θ 1 ( u ) p � X | Θ 1 ( � x | u ) d u

  13. Bernoulli distribution: posterior distribution f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) f Θ 1 | � X ( θ | � x ) = X ( � x ) p � f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) = � u f Θ 1 ( u ) p � X | Θ 1 ( � x | u ) d u θ n 1 ( 1 − θ ) n 0 = u u n 1 ( 1 − u ) n 0 d u �

  14. Bernoulli distribution: posterior distribution f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) f Θ 1 | � X ( θ | � x ) = X ( � x ) p � f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) = � u f Θ 1 ( u ) p � X | Θ 1 ( � x | u ) d u θ n 1 ( 1 − θ ) n 0 = u u n 1 ( 1 − u ) n 0 d u � θ n 1 ( 1 − θ ) n 0 = β ( n 1 + 1 , n 0 + 1 ) � u a − 1 ( 1 − u ) b − 1 d u β ( a , b ) := u

  15. Bernoulli distribution: posterior distribution f Θ 2 | � X ( θ | � x )

  16. Bernoulli distribution: posterior distribution f Θ 2 ( θ ) p � X | Θ 2 ( � x | θ ) f Θ 2 | � X ( θ | � x ) = X ( � x ) p �

  17. Bernoulli distribution: posterior distribution f Θ 2 ( θ ) p � X | Θ 2 ( � x | θ ) f Θ 2 | � X ( θ | � x ) = X ( � x ) p � θ n 1 + 1 ( 1 − θ ) n 0 = u u n 1 + 1 ( 1 − u ) n 0 d u �

  18. Bernoulli distribution: posterior distribution f Θ 2 ( θ ) p � X | Θ 2 ( � x | θ ) f Θ 2 | � X ( θ | � x ) = X ( � x ) p � θ n 1 + 1 ( 1 − θ ) n 0 = u u n 1 + 1 ( 1 − u ) n 0 d u � θ n 1 + 1 ( 1 − θ ) n 0 = β ( n 1 + 2 , n 0 + 1 ) � u a − 1 ( 1 − u ) b − 1 d u β ( a , b ) := u

  19. Bernoulli distribution: n 0 = 1, n 1 = 3 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

  20. Bernoulli distribution: n 0 = 3, n 1 = 1 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

  21. Bernoulli distribution: n 0 = 91, n 1 = 9 Posterior mean (uniform prior) 14 Posterior mean (skewed prior) ML estimator 12 10 8 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0

  22. Learning Bayesian models Conjugate priors Bayesian estimators

  23. Beta random variable Useful in Bayesian statistics Unimodal continuous distribution in the unit interval The pdf of a beta distribution with parameters a and b is defined as � θ a − 1 ( 1 − θ ) b − 1 , if 0 ≤ θ ≤ 1, β ( a , b ) f β ( θ ; a , b ) := 0 otherwise � u a − 1 ( 1 − u ) b − 1 d u β ( a , b ) := u

  24. Beta random variables a = 1 b = 1 6 a = 1 b = 2 a = 3 b = 3 a = 6 b = 2 a = 3 b = 15 4 f X ( x ) 2 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 x

  25. Learning a Bernoulli distribution The first prior is beta with parameters a = 1 and b = 1 The second prior is beta with parameters a = 2 and b = 1 The posteriors are beta with parameters a = n 1 + 1, b = n 0 + 1 and a = n 1 + 2, b = n 0 + 1 respectively

  26. Conjugate priors A conjugate family of distributions for a certain likelihood satisfies the following property: If the prior belongs to the family, the posterior also belongs to the family Beta distributions are conjugate priors when the likelihood is binomial

  27. The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x )

  28. The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x ) = f Θ ( θ ) p X | Θ ( x | θ ) p X ( x )

  29. The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x ) = f Θ ( θ ) p X | Θ ( x | θ ) p X ( x ) f Θ ( θ ) p X | Θ ( x | θ ) = � u f Θ ( u ) p X | Θ ( x | u ) d u

  30. The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x ) = f Θ ( θ ) p X | Θ ( x | θ ) p X ( x ) f Θ ( θ ) p X | Θ ( x | θ ) = � u f Θ ( u ) p X | Θ ( x | u ) d u θ a − 1 ( 1 − θ ) b − 1 � n θ x ( 1 − θ ) n − x � x = u x ( 1 − u ) n − x d u u u a − 1 ( 1 − u ) b − 1 � n � � x

  31. The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x ) = f Θ ( θ ) p X | Θ ( x | θ ) p X ( x ) f Θ ( θ ) p X | Θ ( x | θ ) = � u f Θ ( u ) p X | Θ ( x | u ) d u θ a − 1 ( 1 − θ ) b − 1 � n θ x ( 1 − θ ) n − x � x = u x ( 1 − u ) n − x d u u u a − 1 ( 1 − u ) b − 1 � n � � x θ x + a − 1 ( 1 − θ ) n − x + b − 1 = u u x + a − 1 ( 1 − u ) n − x + b − 1 d u �

  32. The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x ) = f Θ ( θ ) p X | Θ ( x | θ ) p X ( x ) f Θ ( θ ) p X | Θ ( x | θ ) = � u f Θ ( u ) p X | Θ ( x | u ) d u θ a − 1 ( 1 − θ ) b − 1 � n θ x ( 1 − θ ) n − x � x = u x ( 1 − u ) n − x d u u u a − 1 ( 1 − u ) b − 1 � n � � x θ x + a − 1 ( 1 − θ ) n − x + b − 1 = u u x + a − 1 ( 1 − u ) n − x + b − 1 d u � = f β ( θ ; x + a , n − x + b )

  33. Poll in New Mexico 429 participants, 227 people intend to vote for Clinton and 202 for Trump Probability that Trump wins in New Mexico? Assumptions: ◮ Fraction of Trump voters is modeled as a random variable Θ ◮ Poll participants are selected uniformly at random with replacement ◮ Number of Trump voters in the poll is binomial with parameters n = 449 and p = Θ

  34. Poll in New Mexico ◮ Prior is uniform, so beta with parameters a = 1 and b = 1 ◮ Likelihood is binomial ◮ Posterior is beta with parameters a = 202 + 1 and b = 227 + 1 ◮ The probability that Trump wins in New Mexico is the probability that Θ given the data is greater than 0.5

  35. Poll in New Mexico 18 88.6% 16 11.4% 14 12 10 8 6 4 2 0 0.35 0.40 0.45 0.50 0.55 0.60

  36. Learning Bayesian models Conjugate priors Bayesian estimators

  37. Bayesian estimators What estimator should we use? Two main options: ◮ The posterior mean ◮ The posterior mode

  38. Posterior mean Mean of the posterior distribution � � Θ | � � θ MMSE ( � x ) := E X = � x Minimum mean-square-error (MMSE) estimate For any arbitrary estimator θ other ( � x ) , �� � 2 � �� � 2 � θ other ( � X ) − � θ MMSE ( � X ) − � E Θ ≥ E Θ

  39. Posterior mean �� � 2 � � θ other ( � X ) − � � � Θ X = � E x

  40. Posterior mean �� � 2 � � θ other ( � X ) − � � � Θ X = � E x �� � 2 � � θ other ( � X ) − θ MMSE ( � X ) + θ MMSE ( � X ) − � � � = E Θ X = � x �

  41. Posterior mean �� � 2 � � θ other ( � X ) − � � � Θ X = � E x �� � 2 � � θ other ( � X ) − θ MMSE ( � X ) + θ MMSE ( � X ) − � � � = E Θ X = � x � � 2 � x )) 2 + E � � � θ MMSE ( � X ) − � � � = ( θ other ( � x ) − θ MMSE ( � Θ X = � x � � � �� Θ | � � + 2 ( θ other ( � x ) − θ MMSE ( � x )) E θ MMSE ( � x ) − E X = � x

Recommend


More recommend