II.2 Statistical Inference: Sampling and Estimation A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if it can be completely described by a finite number of parameters, e.g., the family of Normal distributions for a finite number of parameters μ , σ : 2 ( x ) 2 1 2 f ( x ; , ) e for R , 0 X 2 2 IR&DM, WS'11/12 October 25, 2011 II.1
Statistical Inference Given a parametric model M and a sample X 1 ,...,X n , how do we infer (learn) the parameters of M? For multivariate models with observed variable X and „outcome (response)“ variable Y, this is called prediction or regression , for a discrete outcome variable this is also called classification . r(x) = E[Y | X=x] is called the regression function . IR&DM, WS'11/12 October 25, 2011 II.2
Idea of Sampling Distribution X Statistical Inference (e.g., a population, What can we say about X based on objects of interest) X 1 ,…, X n ? Samples X 1 ,…, X n drawn from X Distrib. Param. Sample Param. (e.g., people, objects) μ X mean X 2 variance 2 S X X N size n Example: Suppose we want to estimate the average salary of employees in German companies. Sample 1: Suppose we look at n=200 top-paid CEOs of major banks. Sample 2: Suppose we look at n=100 employees across all kinds of companies. IR&DM, WS'11/12 October 25, 2011 II.3
Basic Types of Statistical Inference Given a set of iid. samples X 1 ,...,X n ~ X of an unknown distribution X. e.g.: n single-coin-toss experiments X 1 ,...,X n ~ X: Bernoulli(p) • Parameter Estimation e.g.: - what is the parameter p of X: Bernoulli(p) ? - what is E[X], the cdf F X of X, the pdf f X of X, etc.? • Confidence Intervals e.g.: give me all values C=(a,b) such that P(p C ) ≥ 0.95 where a and b are derived from samples X 1 ,...,X n • Hypothesis Testing e.g.: H 0 : p = 1/2 vs. H 1 : p ≠ 1/2 IR&DM, WS'11/12 October 25, 2011 II.4
Statistical Estimators A point estimator for a parameter of a prob. distribution X is a ˆ random variable derived from an iid. sample X 1 ,...,X n . n n 1 Examples: X : X Sample mean: i n i 1 n 1 2 2 S : ( X X ) Sample variance: X i n 1 i 1 ˆ An estimator for parameter is unbiased n ˆ if E[ ] = ; n ˆ otherwise the estimator has bias E[ ] – . n An estimator on a sample of size n is consistent ˆ lim P [| | ] 1 for any 0 if n n Sample mean and sample variance are unbiased and consistent estimators of μ X and . 2 X IR&DM, WS'11/12 October 25, 2011 II.5
Estimator Error ˆ Let be an estimator for parameter over iid. samples X 1 , ...,X n . n ˆ The distribution of is called the sampling distribution . n ˆ ˆ ˆ The standard error for is: se ( ) Var [ ] n n n ˆ The mean squared error (MSE) for is: n ˆ ˆ 2 MSE ( ) E [( ) ] n n ˆ ˆ 2 bias ( ) Var[ ] n n Theorem: If bias 0 and se 0 then the estimator is consistent. ˆ The estimator is asymptotically Normal if n ˆ converges in distribution to standard Normal N(0,1). ( ) / se n IR&DM, WS'11/12 October 25, 2011 II.6
Types of Estimation • Nonparametric Estimation No assumptions about model M nor the parameters θ of the underlying distribution X “ Plug- in estimators” (e.g. histograms) to approximate X • Parametric Estimation (Inference) Requires assumptions about model M and the parameters θ of the underlying distribution X Analytical or numerical methods for estimating θ Method-of-Moments estimator Maximum Likelihood estimator and Expectation Maximization (EM) IR&DM, WS'11/12 October 25, 2011 II.7
Nonparametric Estimation ˆ The empirical distribution function is the cdf that F n puts probability mass 1/n at each data point X i : 1 ˆ n F ( x ) I( X x ) n i i 1 n 1 if X x i with I ( X x ) i 0 if X x i A statistical functional (“statistics”) T(F) is any function over F, e.g., mean, variance, skewness, median, quantiles, correlation. ˆ ˆ The plug-in estimator of = T(F) is: n T( F ) n ˆ Simply use instead of F to calculate the statistics T of interest. F n IR&DM, WS'11/12 October 25, 2011 II.8
Histograms as Density Estimators Instead of the full empirical distribution, often compact data synopses may be used, such as histograms where X 1 , ...,X n are grouped into m cells (buckets) c 1 , ..., c m with bucket boundaries lb(c i ) and ub(c i ) s.t. lb(c 1 ) = , ub(c m ) = , ub(c i ) = lb(c i+1 ) for 1 i<m , and 1 ˆ n f ( x ) I ( lb ( c ) X ub ( c )) freq f (c i ) = n i v i v 1 n 1 ˆ n freq F (c i ) = F ( x ) I ( X ub ( c )) n v i v 1 n Example: 2 3 5 ˆ n f X (x) 1 2 3 X 1 = 1 20 20 20 X 2 = 1 4 3 2 1 X 3 = 2 4 5 6 7 5/20 20 20 20 20 X 4 = 2 4/20 X 5 = 2 3 . 65 3/20 3/20 X 6 = 3 2/20 2/20 … 1/20 x 1 2 3 4 5 6 7 X 20 =7 Histograms provide a (discontinuous) density estimator . IR&DM, WS'11/12 October 25, 2011 II.9
Parametric Inference (1): Method of Moments Suppose parameter θ = ( θ 1 ,…,θ k ) has k components. j j ( ) E [ X ] x f ( x ) dx Compute j-th moment: j j X 1 n ˆ j-th sample moment: for 1 ≤ j ≤ k j X j i i 1 n ˆ Estimate parameter by method-of-moments estimator s.t. n ˆ ˆ ( ) 1 n 1 ˆ ˆ and ( ) 2 n 2 … … ˆ ˆ and (for the first k moments) ( ) k n k Solve equation system with k equations and k unknowns. Method-of-moments estimators are usually consistent and asymptotically Normal , but may be biased . IR&DM, WS'11/12 October 25, 2011 II.10
Parametric Inference (2): Maximum Likelihood Estimators (MLE) Let X 1 ,...,X n be iid. with pdf f(x; θ ). Estimate parameter of a postulated distribution f(x; ) such that the likelihood that the sample values x 1 ,...,x n are generated by this distribution is maximized. Maximum likelihood estimation: Maximize L(x 1 ,...,x n ; ) ≈ P[x 1 , ...,x n originate from f(x; )] Usually formulated as L n ( ) = ∏ i f(X i ; ) Or (alternatively) Maximize l n ( ) = log L n ( ) ˆ The value that maximizes L n ( ) is the MLE of . n If analytically untractable use numerical iteration methods IR&DM, WS'11/12 October 25, 2011 II.11
Simple Example for Maximum Likelihood Estimator Given: • Coin toss experiment (Bernoulli distribution) with unknown parameter p for seeing heads, 1-p for tails • Sample (data): h times head with n coin tosses Want: Maximum likelihood estimation of p n n X 1 X h n h Let L(h, n, p) f ( X ; p ) p ( 1 p ) p ( 1 p ) i i i i 1 i 1 with h = ∑ i X i Maximize log-likelihood function: log L (h, n, p) h log( p ) ( n h ) log( 1 p ) h ln L h n h p 0 n p p 1 p IR&DM, WS'11/12 October 25, 2011 II.12
MLE for Parameters of Normal Distributions 2 n ( x ) n 1 i 2 2 L ( x ,..., x , , ) e 2 1 n 2 i 1 n ln( L ) 1 2( x ) 0 i 2 i 1 2 n ln( L ) n 1 2 ( x ) 0 i 2 2 4 2 2 i 1 n n 1 1 ˆ ˆ ˆ 2 2 x ( x ) i i n n i 1 i 1 IR&DM, WS'11/12 October 25, 2011 II.13
MLE Properties Maximum Likelihood estimators are consistent , asymptotically Normal , and asymptotically optimal (i.e., efficient ) in the following sense: Consider two estimators U and T which are asymptotically Normal. Let u 2 and t 2 denote the variances of the two Normal distributions to which U and T converge in probability. The asymptotic relative efficiency of U to T is ARE(U,T) := t 2 /u 2 . ˆ Theorem: For an MLE and any other estimator n n the following inequality holds: ˆ ARE( , ) 1 n n That is, among all estimators MLE has the smallest variance. IR&DM, WS'11/12 October 25, 2011 II.14
Bayesian Viewpoint of Parameter Estimation • Assume prior distribution g( ) of parameter • Choose statistical model ( generative model ) f (x | ) that reflects our beliefs about RV X • Given RVs X 1 ,...,X n for the observed data, the posterior distribution is h ( | x 1 ,...,x n ) For X 1 = x 1 , ... ,X n = x n the likelihood is h ( | x ) f ( x | ' ) g ( ' ) n n i i ' L ( x ... x , ) f ( x | ) 1 n i i 1 i 1 g ( ) which implies (posterior is proportional to h ( | x ... x ) ~ L ( x ... x , ) g ( ) likelihood times prior) 1 n 1 n MAP estimator (maximum a posteriori): Compute that maximizes h( | x 1 , …, x n ) given a prior for . IR&DM, WS'11/12 October 25, 2011 II.15
Analytically Non-tractable MLE for parameters of Multivariate Normal Mixture Consider samples from a k -mixture of m -dimensional Normal distributions with the density (e.g. height and weight of males and females): f ( x , ,..., , ,..., , ,..., ) 1 k 1 k 1 k 1 k k 1 T 1 ( x ) ( x ) j j j n ( x , , ) e 2 j j j j m ( 2 ) j 1 j 1 j with expectation values j and invertible, positive definite, symmetric m m covariance matrices j Maximize log-likelihood function: n n k log L ( x ,..., x , ) : log P [ x | ] log n ( x , , ) 1 n i j i j j i 1 i 1 j 1 IR&DM, WS'11/12 October 25, 2011 II.16
Recommend
More recommend