Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur - PowerPoint PPT Presentation

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. Srihari University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Srihari

Topics in Estimation 1. Estimation 2. Desirable Properties of Estimators 3. Maximum Likelihood Estimation Examples: Binomial, Normal 4. Bayesian Estimation Examples: Binomial, Normal 1. Jeffrey ʼ s Prior 2 Srihari

Estimation • In inference we want to make statements about entire population from which sample is drawn • Two most important methods for estimating parameters of a model: 1. Maximum Likelihood Estimation 2. Bayesian Estimation 3 Srihari

Desirable Properties of Estimators ˆ • Let be an estimate of Parameter θ θ ˆ • Two measures of estimator quality θ Expectation over all possible 1. Expected Value of Estimate (Bias) data sets of size n • Difference between expected and true value ∧ Bias ( ˆ ) = E [ θ ] − θ θ – Measures Systematic departure from true value 2. Variance of Estimate ∧ ∧ ∧ ]] 2 Var ( θ ) = E [ θ − E [ θ – Data driven component of error in estimation procedure ∧ = 1 – E.g., Always saying has a variance of zero but high bias θ • Mean Squared Error can be partitioned ∧ − θ ) 2 ] E [( θ Srihari as sum of bias 2 and variance

Bias-Variance in Point Estimate True height of Chinese emperor: 200cm, about 6 ʼ 6” Poll a random American: ask “How tall is the emperor?” We want to determine how wrong they are, on average Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate 200 200 200 Bias Bias Bias 180 180 180 No variance Some variance More variance • Scenario 3 • Normally distributed • Scenario 2 beliefs with mean 180 and • Scenario 1 • Normally distributed beliefs with std dev 20 (variance=400) mean 180 and std dev 10 • Everyone believes it is • Poll two: One says 200 (variance 100) 180 (variance=0) and other 160 • Poll two: One says 190, other 170 • Answer is always 180 • Errors: 0 and -40 • Bias Errors are -10 and -30 • The error is always -20 • Average bias error is -20 – Ave error is -20 • Ave squared error is 400 • Squared errors: 100 and 900 • Sq. errors: 0 and 1600 • Average bias error is 20 • Ave squared error: 500 • 400=400+0 – Ave squared error: 800 • 500 = 400 + 100 • 800 = 400 + 400 Squared error = Square of bias error + Variance As variance increases, error increases

Mean Squared Error as a ˆ Criterion for θ • Natural decomposition as sum of squared bias and its variance ∧ ∧ ∧ ∧ − θ ) 2 ] = E [( θ ] − θ ) 2 ] E [( θ − E [ θ ] + E [ θ ∧ ∧ ∧ ] − θ ) 2 + E [( θ ]) 2 ] = ( E [ θ − E [ θ ∧ ∧ )) 2 + Var ( θ = ( Bias ( θ ) • Mean squared error (over data sets) is a useful criterion since it incorporates both bias and variance 6 Srihari

Maximum Likelihood Estimation • Most widely used method for parameter estimation • Likelihood Function is probability that data D would have arisen for a given value of θ L ( θ | D ) = L ( θ | x (1),..., x ( n )) = p ( x (1),..., x ( n ) | θ ) n ∏ p ( x ( i ) | θ ) = i = 1 • A scalar function of θ • Value of θ for which the data has the highest probability is the MLE Srihari

Example of MLE for Binomial • Customers either purchase/ not purchase milk • We want estimate of proportion purchasing • Binomial with unknown parameter θ • Binomial is a generalization of Bernoulli • Bernoulli: probability of a binary variable x=1 or 0 Denoting p(x=1)= θ probability mass function Bern(x| θ )= θ x (1- θ ) 1-x • Mean= θ , variance = θ (1 - θ ) • Binomial: probability of r successes in n trials Bin ( r | n , θ ) = n   θ r (1 − θ ) n − r   r   Srihari • Mean= n θ , variance = n θ (1 - θ )

Likelihood Function: Binomial/Bernoulli • Samples x(1),.. x(1000) where r purchase milk • Assuming conditional independence, likelihood function is ∏ θ x ( i ) (1 − θ ) n − x ( i ) = θ r (1 − θ ) 1000 − r L ( θ | x (1),.., x (1000)) = i • Binomial pdf includes every possible way of getting r successes so it has n C r additive terms • Log-likelihood Function l ( θ ) = log L ( θ ) = r log θ + (1000 − r )log(1 − θ ) • Differentiating and setting equal to zero 9 ˆ Srihari ML = r /1000 θ

Binomial: Likelihood Functions Likelihood function for three data sets r=7 n=10 Binomial distribution r milk purchases out of n r=70 n=100 customers θ is the probability that milk is r=700 purchased by random n=1000 customer Uncertainty becomes smaller as n increases 10 Srihari

Likelihood under Normal Distribution • Unit variance, Unknown mean • Likelihood function n   exp − 1 2 ∏ (2 π ) − 1/ 2 ( ) L ( θ | x (1),..., x ( n )) = 2 x ( i ) − θ     i = 1   n − n / 2 exp − 1 2 ∑ ( ) ( ) = 2 π x ( i ) − θ   2   • Log-likelihood function i = 1 n l ( θ | x (1),..., x ( n )) = − n 2 log2 π − 1 ∑ ( x ( i ) − θ ) 2 2 i = 1 • To find mle set derivative d/d θ to zero ˆ ∑ x ( i )/ n θ ML = i 11 Srihari

Normal: Histogram, Likelihood Log-Likelihood Estimate unknown mean θ Histogram of 20 data points drawn from zero mean, unit variance Likelihood function Log-Likelihood function 12 Srihari

Normal: More data points Histogram of 200 data points drawn from zero mean, unit variance Likelihood function Log-Likelihood function 13 Srihari

Sufficient Statistics • Useful general concept in statistical estimation • Quantity s(D) is a sufficient statistic for θ if the likelihood l ( θ ) only depends on the data through s(D) • Examples • For binomial parameter θ number of successes r is sufficient • For the mean of normal distribution sum of the observations Σ x(i) is sufficient for likelihood function of mean (which is a function of θ )

Interval Estimate • Point estimate does not convey uncertainty associated with it • Interval estimates provide a confidence interval • Example: • 100 observations from N(unknown µ ,known σ 2 ) • we want 95% confidence interval for estimate of m • Since distribution of sample mean is N( µ , σ 2 /100) • 95% of normal distribution lies within 1.96 std dev of mean p( µ -1.96 σ /10)<x_< µ +1.96 σ /10)=0.95 Rewritten a s P(x_-1.96 σ /10) <x_<1.96 σ /10)=0.95 15 l(x)=x_-1.96 σ /10 and u(x)=x_+1.96 σ /10 is a 95% interva l

Bayesian Approach • Frequentist Approach • Parameters are fixed but unknown • Data is a random sample • Intrinsic variablity lies in data D={ x (1),.. x (n)} • Bayesian Statistics • Data are known • Parameters θ are random variables • θ has a distribution of values • p( θ ) reflects degree of belief on where true parameters θ may be 16 Srihari

Bayesian Estimation • Distribution of probabilities for θ is the prior p( θ ) • Analysis of data leads to modified distribution, called the posterior p( θ / D) • Modification done by Bayes rule p ( θ | D ) = p ( D | θ ) p ( θ ) p ( D | θ ) p ( θ ) = ∫ p ( D ) p ( D | ψ ) p ( ψ ) ψ ψ • Leads to a distribution rather than single value • Single value obtainable: mean or mode (latter known as maximum aposteriori (MAP) method) • MAP and MLE of θ may well coincide – Since prior is flat preferring no single value – MLE can be viewed as a special case of MAP procedure, which in turn is a restricted form of Bayesian estimation

Summary of Bayesian Approach • For a given data set D and a particular model (model = distributions for prior and likelihood) p ( θ | D ) ∝ p ( D | θ ) p ( θ ) • In words: Posterior distribution given D (distribution conditioned on having observed the data) is • Proportionl to product: prior p( θ ) & likelihood p(D| θ ) • If we have a weak belief about parameter before collecting data, choose a wide prior (normal with large variance) • Larger the data, more dominant is the likelihood Srihari

Bayesian Binomial • Single Binary variable X : wish to estimate θ =p(X=1) • Prior for parameter in [0,1] is the Beta distribtn Where α > 0, β > 0 are two parameters α -1 α E [ θ ] = α + β mode[ θ ] = α + β - 2 p ( θ ) ∝ θ α − 1 (1 − θ ) β − 1 of this model αβ Beta ( θ | α , β ) = Γ ( α + β ) var[ θ ] = Γ ( α ) Γ ( β ) θ α − 1 (1 − θ ) β − 1 ( α + β ) 2 ( α + β + 1) • Likelihood (same as for MLE): L ( θ | D ) = θ r (1 − θ ) n − r • Combining likelihood and prior p ( θ | D ) ∝ p ( D | θ ) p ( θ ) = θ r (1 − θ ) n − r θ α − 1 (1 − θ ) β − 1 = θ r + α − 1 (1 − θ ) n − r + β − 1 • We get another Beta distribution r + α • With parameters r+ α and n-r+ β and mean = E [ θ ] = n + α + β • If α = β =0 we get the standard MLE of r/n

Advantages of Bayesian Approach • Retain full knowledge of all problem uncertainty • Eg, calculating full posterior distribution on θ • E.g., Prediction of new point x(n+1) not in training set D is done by averaging over all possible θ ∫ p ( x ( n + 1) | D ) = p ( x ( n + 1), θ | D ) d θ Since x(n+1) is conditionally independent ∫ p ( x ( n + 1) | θ ) p ( θ | D ) d θ = of training data D • Can average over all possible models • Considerably more computation than max likelihood • Natural updating of distribution sequentially p ( θ | D 1 , D 2 ) ∝ p ( D 2 | θ ) p ( D 1 | θ ) p ( θ ) 20 Srihari

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur - PowerPoint PPT Presentation

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. Srihari University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Srihari Topics in Estimation 1. Estimation 2. Desirable Properties of Estimators

1 Estimation from outbreaks when R 0 < 1 Estimating R 0 : from epidemic data If case data are

Data Analysis and Uncertainty Part 1: Random Variables Instructor: Sargur N. Srihari University

Panel Data Analysis Part III Modern Moment Estimation James J. Heckman University of Chicago

Panel Data Analysis Part III Modern Moment Estimation Arellano and Honor (2000) James J.

ESTIMATION AS UNCERTAINTY REDUCTION What is this estimation thing, anyway? Michael Godeck

Data Analysis and Uncertainty Part 3: Hypothesis Testing/Sampling Instructor: Sargur N. Srihari

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 4, part B

HYPOTHESIS TESTING PART I RECAP & OUTLOOK BAYESIAN PARAMETER ESTIMATION FREQUENTIST

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 4, part A

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 3, part B

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 2, part B

Approach to Reducing Uncertainty in Flood level Estimation Mervyn Bramley Engineering Theme

CONSUMER BEHAVIOR UNDER UNCERTAINTY PART 1 CONCEPTS Uncertainty is everywhere Future

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 6, part B

Uncertainty Estimation Using a Single Deep Deterministic Neural Network Paper ID: 4538 Joost van

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 5, part B

CHAPTER IV Data Presentation, Analysis and Interpretation: Part- 4. 1. Introduction The data

UNCERTAINTY IN KNOWLEDGE Ch. 9 Uncertainty in Knowledge 1 Sources of Uncertainty

Data Analysis and Uncertainty Instructor: Sargur N. Srihari University at Buffalo The State

Lecture 18 Part 3. Understanding Uncertainty Ch.16: Probability Interpretations of

Influence of uncertainty in hadronic interaction models on the sensitivity estimation of

Gravitational Wave Data Analysis: II. Model Selection and Parameter Estimation Chris Van Den

Schedulability Analysis under Uncertainty using Formal Methods (part 2) tienne Andr and

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur - PowerPoint PPT Presentation

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. Srihari University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Srihari Topics in Estimation 1. Estimation 2. Desirable Properties of Estimators

1 Estimation from outbreaks when R 0 &lt; 1 Estimating R 0 : from epidemic data If case data are

Data Analysis and Uncertainty Part 1: Random Variables Instructor: Sargur N. Srihari University

Panel Data Analysis Part III Modern Moment Estimation James J. Heckman University of Chicago

Panel Data Analysis Part III Modern Moment Estimation Arellano and Honor (2000) James J.

ESTIMATION AS UNCERTAINTY REDUCTION What is this estimation thing, anyway? Michael Godeck

Data Analysis and Uncertainty Part 3: Hypothesis Testing/Sampling Instructor: Sargur N. Srihari

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 4, part B

HYPOTHESIS TESTING PART I RECAP &amp; OUTLOOK BAYESIAN PARAMETER ESTIMATION FREQUENTIST

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 4, part A

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 3, part B

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

ETC1010: Introduction to Data Analysis ETC1010: Introduction to Data Analysis Week 2, part B

Approach to Reducing Uncertainty in Flood level Estimation Mervyn Bramley Engineering Theme

CONSUMER BEHAVIOR UNDER UNCERTAINTY PART 1 CONCEPTS Uncertainty is everywhere Future

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 6, part B

Uncertainty Estimation Using a Single Deep Deterministic Neural Network Paper ID: 4538 Joost van

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 5, part B

CHAPTER IV Data Presentation, Analysis and Interpretation: Part- 4. 1. Introduction The data

UNCERTAINTY IN KNOWLEDGE Ch. 9 Uncertainty in Knowledge 1 Sources of Uncertainty

Data Analysis and Uncertainty Instructor: Sargur N. Srihari University at Buffalo The State

Lecture 18 Part 3. Understanding Uncertainty Ch.16: Probability Interpretations of

Influence of uncertainty in hadronic interaction models on the sensitivity estimation of

Gravitational Wave Data Analysis: II. Model Selection and Parameter Estimation Chris Van Den

Schedulability Analysis under Uncertainty using Formal Methods (part 2) tienne Andr and

1 Estimation from outbreaks when R 0 < 1 Estimating R 0 : from epidemic data If case data are

HYPOTHESIS TESTING PART I RECAP & OUTLOOK BAYESIAN PARAMETER ESTIMATION FREQUENTIST