Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. Srihari University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Srihari
Topics in Estimation 1. Estimation 2. Desirable Properties of Estimators 3. Maximum Likelihood Estimation Examples: Binomial, Normal 4. Bayesian Estimation Examples: Binomial, Normal 1. Jeffrey ʼ s Prior 2 Srihari
Estimation • In inference we want to make statements about entire population from which sample is drawn • Two most important methods for estimating parameters of a model: 1. Maximum Likelihood Estimation 2. Bayesian Estimation 3 Srihari
Desirable Properties of Estimators ˆ • Let be an estimate of Parameter θ θ ˆ • Two measures of estimator quality θ Expectation over all possible 1. Expected Value of Estimate (Bias) data sets of size n • Difference between expected and true value ∧ Bias ( ˆ ) = E [ θ ] − θ θ – Measures Systematic departure from true value 2. Variance of Estimate ∧ ∧ ∧ ]] 2 Var ( θ ) = E [ θ − E [ θ – Data driven component of error in estimation procedure ∧ = 1 – E.g., Always saying has a variance of zero but high bias θ • Mean Squared Error can be partitioned ∧ − θ ) 2 ] E [( θ Srihari as sum of bias 2 and variance
Bias-Variance in Point Estimate True height of Chinese emperor: 200cm, about 6 ʼ 6” Poll a random American: ask “How tall is the emperor?” We want to determine how wrong they are, on average Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate 200 200 200 Bias Bias Bias 180 180 180 No variance Some variance More variance • Scenario 3 • Normally distributed • Scenario 2 beliefs with mean 180 and • Scenario 1 • Normally distributed beliefs with std dev 20 (variance=400) mean 180 and std dev 10 • Everyone believes it is • Poll two: One says 200 (variance 100) 180 (variance=0) and other 160 • Poll two: One says 190, other 170 • Answer is always 180 • Errors: 0 and -40 • Bias Errors are -10 and -30 • The error is always -20 • Average bias error is -20 – Ave error is -20 • Ave squared error is 400 • Squared errors: 100 and 900 • Sq. errors: 0 and 1600 • Average bias error is 20 • Ave squared error: 500 • 400=400+0 – Ave squared error: 800 • 500 = 400 + 100 • 800 = 400 + 400 Squared error = Square of bias error + Variance As variance increases, error increases
Mean Squared Error as a ˆ Criterion for θ • Natural decomposition as sum of squared bias and its variance ∧ ∧ ∧ ∧ − θ ) 2 ] = E [( θ ] − θ ) 2 ] E [( θ − E [ θ ] + E [ θ ∧ ∧ ∧ ] − θ ) 2 + E [( θ ]) 2 ] = ( E [ θ − E [ θ ∧ ∧ )) 2 + Var ( θ = ( Bias ( θ ) • Mean squared error (over data sets) is a useful criterion since it incorporates both bias and variance 6 Srihari
Maximum Likelihood Estimation • Most widely used method for parameter estimation • Likelihood Function is probability that data D would have arisen for a given value of θ L ( θ | D ) = L ( θ | x (1),..., x ( n )) = p ( x (1),..., x ( n ) | θ ) n ∏ p ( x ( i ) | θ ) = i = 1 • A scalar function of θ • Value of θ for which the data has the highest probability is the MLE Srihari
Example of MLE for Binomial • Customers either purchase/ not purchase milk • We want estimate of proportion purchasing • Binomial with unknown parameter θ • Binomial is a generalization of Bernoulli • Bernoulli: probability of a binary variable x=1 or 0 Denoting p(x=1)= θ probability mass function Bern(x| θ )= θ x (1- θ ) 1-x • Mean= θ , variance = θ (1 - θ ) • Binomial: probability of r successes in n trials Bin ( r | n , θ ) = n θ r (1 − θ ) n − r r Srihari • Mean= n θ , variance = n θ (1 - θ )
Likelihood Function: Binomial/Bernoulli • Samples x(1),.. x(1000) where r purchase milk • Assuming conditional independence, likelihood function is ∏ θ x ( i ) (1 − θ ) n − x ( i ) = θ r (1 − θ ) 1000 − r L ( θ | x (1),.., x (1000)) = i • Binomial pdf includes every possible way of getting r successes so it has n C r additive terms • Log-likelihood Function l ( θ ) = log L ( θ ) = r log θ + (1000 − r )log(1 − θ ) • Differentiating and setting equal to zero 9 ˆ Srihari ML = r /1000 θ
Binomial: Likelihood Functions Likelihood function for three data sets r=7 n=10 Binomial distribution r milk purchases out of n r=70 n=100 customers θ is the probability that milk is r=700 purchased by random n=1000 customer Uncertainty becomes smaller as n increases 10 Srihari
Likelihood under Normal Distribution • Unit variance, Unknown mean • Likelihood function n exp − 1 2 ∏ (2 π ) − 1/ 2 ( ) L ( θ | x (1),..., x ( n )) = 2 x ( i ) − θ i = 1 n − n / 2 exp − 1 2 ∑ ( ) ( ) = 2 π x ( i ) − θ 2 • Log-likelihood function i = 1 n l ( θ | x (1),..., x ( n )) = − n 2 log2 π − 1 ∑ ( x ( i ) − θ ) 2 2 i = 1 • To find mle set derivative d/d θ to zero ˆ ∑ x ( i )/ n θ ML = i 11 Srihari
Normal: Histogram, Likelihood Log-Likelihood Estimate unknown mean θ Histogram of 20 data points drawn from zero mean, unit variance Likelihood function Log-Likelihood function 12 Srihari
Normal: More data points Histogram of 200 data points drawn from zero mean, unit variance Likelihood function Log-Likelihood function 13 Srihari
Sufficient Statistics • Useful general concept in statistical estimation • Quantity s(D) is a sufficient statistic for θ if the likelihood l ( θ ) only depends on the data through s(D) • Examples • For binomial parameter θ number of successes r is sufficient • For the mean of normal distribution sum of the observations Σ x(i) is sufficient for likelihood function of mean (which is a function of θ )
Interval Estimate • Point estimate does not convey uncertainty associated with it • Interval estimates provide a confidence interval • Example: • 100 observations from N(unknown µ ,known σ 2 ) • we want 95% confidence interval for estimate of m • Since distribution of sample mean is N( µ , σ 2 /100) • 95% of normal distribution lies within 1.96 std dev of mean p( µ -1.96 σ /10)<x_< µ +1.96 σ /10)=0.95 Rewritten a s P(x_-1.96 σ /10) <x_<1.96 σ /10)=0.95 15 l(x)=x_-1.96 σ /10 and u(x)=x_+1.96 σ /10 is a 95% interva l
Bayesian Approach • Frequentist Approach • Parameters are fixed but unknown • Data is a random sample • Intrinsic variablity lies in data D={ x (1),.. x (n)} • Bayesian Statistics • Data are known • Parameters θ are random variables • θ has a distribution of values • p( θ ) reflects degree of belief on where true parameters θ may be 16 Srihari
Bayesian Estimation • Distribution of probabilities for θ is the prior p( θ ) • Analysis of data leads to modified distribution, called the posterior p( θ / D) • Modification done by Bayes rule p ( θ | D ) = p ( D | θ ) p ( θ ) p ( D | θ ) p ( θ ) = ∫ p ( D ) p ( D | ψ ) p ( ψ ) ψ ψ • Leads to a distribution rather than single value • Single value obtainable: mean or mode (latter known as maximum aposteriori (MAP) method) • MAP and MLE of θ may well coincide – Since prior is flat preferring no single value – MLE can be viewed as a special case of MAP procedure, which in turn is a restricted form of Bayesian estimation
Summary of Bayesian Approach • For a given data set D and a particular model (model = distributions for prior and likelihood) p ( θ | D ) ∝ p ( D | θ ) p ( θ ) • In words: Posterior distribution given D (distribution conditioned on having observed the data) is • Proportionl to product: prior p( θ ) & likelihood p(D| θ ) • If we have a weak belief about parameter before collecting data, choose a wide prior (normal with large variance) • Larger the data, more dominant is the likelihood Srihari
Bayesian Binomial • Single Binary variable X : wish to estimate θ =p(X=1) • Prior for parameter in [0,1] is the Beta distribtn Where α > 0, β > 0 are two parameters α -1 α E [ θ ] = α + β mode[ θ ] = α + β - 2 p ( θ ) ∝ θ α − 1 (1 − θ ) β − 1 of this model αβ Beta ( θ | α , β ) = Γ ( α + β ) var[ θ ] = Γ ( α ) Γ ( β ) θ α − 1 (1 − θ ) β − 1 ( α + β ) 2 ( α + β + 1) • Likelihood (same as for MLE): L ( θ | D ) = θ r (1 − θ ) n − r • Combining likelihood and prior p ( θ | D ) ∝ p ( D | θ ) p ( θ ) = θ r (1 − θ ) n − r θ α − 1 (1 − θ ) β − 1 = θ r + α − 1 (1 − θ ) n − r + β − 1 • We get another Beta distribution r + α • With parameters r+ α and n-r+ β and mean = E [ θ ] = n + α + β • If α = β =0 we get the standard MLE of r/n
Advantages of Bayesian Approach • Retain full knowledge of all problem uncertainty • Eg, calculating full posterior distribution on θ • E.g., Prediction of new point x(n+1) not in training set D is done by averaging over all possible θ ∫ p ( x ( n + 1) | D ) = p ( x ( n + 1), θ | D ) d θ Since x(n+1) is conditionally independent ∫ p ( x ( n + 1) | θ ) p ( θ | D ) d θ = of training data D • Can average over all possible models • Considerably more computation than max likelihood • Natural updating of distribution sequentially p ( θ | D 1 , D 2 ) ∝ p ( D 2 | θ ) p ( D 1 | θ ) p ( θ ) 20 Srihari
Recommend
More recommend