section 4 statistics and inference
play

Section 4: Statistics and Inference Probability : an abstract - PDF document

Mathematical Tools for Neural and Cognitive Science Fall semester, 2016 Section 4: Statistics and Inference Probability : an abstract mathematical framework for describing random quantities (e.g. measurements) Statistics : use of probability


  1. Mathematical Tools for Neural and Cognitive Science Fall semester, 2016 Section 4: Statistics and Inference Probability : an abstract mathematical framework for describing random quantities (e.g. measurements) Statistics : use of probability to summarize, analyze, interpret data. Fundamental to all experimental science.

  2. Statistics as a form of summary 0, 1, 0, 0, 0, 1, 0, 1, ... The purpose of statistics is to replace a quantity of data by relatively few quantities which shall ... contain as much as possible, ideally the whole, of the relevant information P(x) contained in the original data. - R.A. Fisher, 1934 Statistics for Data Summary... • Sample average (minimizes mean squared error) • Sample median (minimizes mean absolute deviation) • Least-squares regression - summarizes relationships between controlled and measured quantities • TLS regression - summarizes relationships between measured quantities

  3. - Efron & Tibshirani, Introduction to the Bootstrap Scientific process Observe / Measure Generate predictions, Summarize, and Design experiment compare with expectations Create/modify Hypothesis/model

  4. Probability basics • discrete probability distributions • continuous probability densities • cumulative distributions • translation and scaling of distributions (adding or multiplying by a constant) • monotonic nonlinear transformations • drawing samples from a distribution via inverse cumulative mapping • example densities/distributions [on board] Example distributions a not-quite-fair coin roll of a fair die sum of two rolled fair dice 0.7 0.2 0.2 0.6 0.15 0.15 0.5 0.4 0.1 0.1 0.3 0.2 0.05 0.05 0.1 0 0 0 1 2 3 4 5 6 2 3 4 5 6 7 8 9 10 11 12 0 1 clicks of a Geiger counter, horizontal velocity of gas ... and, time between clicks in a fixed time interval molecules exiting a fan 0.25 0.1 0.2 0.08 0.15 0.06 0.1 0.04 0.05 0.02 0 0 1 2 3 4 5 6 7 8 9 10 11 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 10

  5. Multi-dimensional random variables • Joint distributions • Marginals (integrating) • Conditionals (slicing) • Bayes’ Rule (inverting) • Statistical independence Joint distribution p ( x, y )

  6. Marginal distribution p ( x, y ) Z p ( x ) = p ( x, y ) dy Generalized marginal distribution ˆ u Using vector notation: Z p ( z ) = p ( ~ x ) d ~ x ~ x · ˆ u = z z

  7. Conditional distribution p ( x, y ) p ( x | y = 68) Conditional distribution P(x|Y=68) �Z p ( x | y = 68) = p ( x, y = 68) p ( x, y = 68) dx . = p ( x, y = 68) p ( y = 68) More generally: p ( x | y ) = p ( x, y ) /p ( y ) slice joint distribution normalize (by marginal)

  8. Bayes’ Rule p ( x | y ) = p ( y | x ) p ( x ) /p ( y ) (a direct consequence of the definition of conditional probability) Conditional vs. marginal P ( x | Y =120) P ( x ) In general, these differ. When are they they same? In particular, when are all conditionals equal to the marginal?

  9. Statistical independence Variables x and y are statistically independent if (and only if): p ( x, y ) = p ( x ) p ( y ) Independence implies that all condionals are equal to the corresponding marginal: p ( y | x ) = p ( y, x ) /p ( x ) = p ( y ) , ∀ x Uncorrelated doesn’t mean independent... Statistical independence a stronger assumption uncorrelatedness ⇒ All independent variables are uncorrelated ⇒ Not all uncorrelated variables are independent: r =

  10. Expected value Z [the mean, ] E ( x ) = x p ( x ) dx µ Z x 2 p ( x ) dx E ( x 2 ) = [the “second moment”] Z [the variance, ] ( x − µ ) 2 p ( x ) dx σ 2 ( x − µ ) 2 � � E = Z x 2 p ( x ) dx − µ 2 = Z [note: an inner product, E ( f ( x )) = f ( x ) p ( x ) dx and thus linear !] Mean and (co)variance • One-D: mean and covariance summarize centroid/width • translation and rescaling of random variables • nonlinear transformations - “warping” • Multi-D: vector mean and covariance matrix, elliptical geometry • Mean/variance of weighted sum of random variables • The sample average • ... converges to true mean (except for bizarre distributions) • ... with variance • ... most common common choice for an estimate ... • Correlation

  11. Distribution of a sum of independent R.V.’s - the return of convolution The Central Limit Theorem [on board] Central limit for a uniform distribution... 10k samples, uniform density (sigma=1) 10 4 samples of uniform dist (u+u)/sqrt(2) 250 450 400 200 350 300 150 250 200 100 150 100 50 50 0 0 − 4 − 3 − 2 − 1 0 1 2 3 4 − 4 − 3 − 2 − 1 0 1 2 3 4 (u+u+u+u)/sqrt(4) 10 u’s divided by sqrt(10) 500 600 450 500 400 350 400 300 250 300 200 200 150 100 100 50 0 0 − 4 − 3 − 2 − 1 0 1 2 3 4 − 4 − 3 − 2 − 1 0 1 2 3 4

  12. Central limit for a binary distribution... one coin avg of 16 coins 6000 2000 5000 1500 4000 3000 1000 2000 500 1000 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 avg of 4 coins avg of 256 coins avg of 64 coins 4000 2500 2000 2000 3000 1500 1500 2000 1000 1000 1000 500 500 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 The Gaussian • parameterized by mean and stdev (position / width) • joint density of two indep Gaussian RVs is circular! [easy] • product of two Gaussians is Gaussian! [easy] • conditionals of a Gaussian are Gaussian! [easy] • sum of Gaussian RVs is Gaussian! [moderate] • marginals of a Gaussian are Gaussian! [moderate] • central limit theorem: sum of many RVs is Gaussian! [hard] • most random (max entropy) density with this variance! [moderate]

  13. Gaussian densities mean: [0.2, 0.8] cov: [1.0 -0.3; -0.3 0.4] Product of Gaussians is Gaussian Completing the square shows that this posterior is also Gaussian, with (average, weighted by inverse variances!)

  14. Product of Gaussians is Gaussian p ( x | y ) p ( y | x ) p ( x ) ∝  �  � − 1 1 n ( x − y ) 2 − 1 1 x ( x − µ x ) 2 2 σ 2 2 σ 2 ∝ e e ✓ ◆ ✓ ◆ � − 1 1 n + 1 x 2 − 2 y n + µx x + ... 2 σ 2 σ 2 σ 2 σ 2 = e x x Completing the square shows that this posterior is also Gaussian, with (average, weighted by inverse variances!) (known as the “precision” matrix) let P = C − 1 ~ x ∼ N ( ~ µ, C ) , Gaussian, with: Conditional: Marginal: Z p ( x 1 ) = p ( ~ xdx 2 Gaussian, with:

  15. Generalized marginals of a Gaussian x2 u T ~ z = ˆ x w x1 p ( z ) is Gaussian, with: u T ~ µ z µ x = ˆ u T C x ˆ � 2 u = ˆ z z ˆ u true density 700 samples Measurement (sampling) Inference true mean: [0 0.8] sample mean: [-0.05 0.83] true cov: [1.0 -0.25 sample cov: [0.95 -0.23 -0.25 0.3] -0.23 0.29]

  16. Point Estimates • Estimator: Any function of the data, intended to represent the best approximation of the true value of a parameter • Most common estimator is the sample average • Statistically-motivated examples: - Maximum likelihood (ML): - Max a posteriori (MAP): - Min Mean Squared Error (MMSE): p(x|y) proportional to p(x) * p(y|x) • why must both prior and likelihood be taken into account? • why doesn’t data dominate? • when would it? when would prior dominate? • what if prior and likelihood are incompatible?

  17. Likelihood: 1 head Likelihood: 1 tail Posteriors, p(H,T|x), assuming prior p(x)=1 More tails T=0 1 2 3 More heads H=0 1 2 3

  18. example infer whether a coin is fair by flipping it repeatedly here, x is the probability of heads (50% is fair) y 1. ..n are the outcomes of flips Consider three different priors: suspect fair suspect biased no idea prior fair prior biased prior uncertain X likelihood (heads) = posterior

  19. previous posteriors X likelihood (heads) = new posterior previous posteriors X likelihood (tails) = new posterior

  20. Posteriors after observing 75 heads, 25 tails à prior differences are ultimately overwhelmed by data Confidence PDFs 2H / 1T 10H / 5T 20H / 10T CDFs .975 .025 .19 .93 .49 .80

  21. Bias & Variance • MSE = bias^2 + variance • Bias is difficult to assess (since requires knowing the “true” value). But variance is easier. • Classical statistics generally aims for an unbiased estimator, with minimal variance • The MLE is asymptotically unbiased (under fairly general conditions), but this is only useful if - the likelihood model is correct - the optimum can be computed - you have enough data • More general/modern view: estimation is about trading off bias and variance, through model selection, “regularization”, or Bayesian priors. Optimization... Heuristics, exhaustive search, (pain & suffering) Smooth (C 2 ) Iterative descent, Convex (possibly) nonunique Quadratic Iterative descent, unique Closed-form, and unique statAnMod - 9/12/07 - E.P. Simoncelli

Recommend


More recommend