Estimation: Sample Averages, Bias, and Concentration Inequalities CMPUT 296: Basics of Machine Learning Textbook §3.1-3.3
Logistics Reminders: • Thought Question 1 (due Thursday, September 17 ) • Assignment 1 (due Thursday, September 24 ) New: • Group Slack channel: #cmput296-fall20 (on Amii workspace)
Recap • Random variables are functions from sample to some value • Upshot: A random variable takes different values with some probability • The value of one variable can be informative about the value of another (because they are both functions of the same sample) • Distributions of multiple random variables are described by the joint probability distribution (joint PMF or joint PDF) • Conditioning on a random variable gives a new distribution over others is independent of : conditioning on does not give a new distribution over X Y X Y • is conditionally independent of given : X Y Z P ( Y ∣ X , Z ) = P ( Y ∣ Z ) • • The expected value of a random variable is an average over its values, weighted by the probability of each value
Outline 1. Recap & Logistics 2. Variance and Correlation 3. Estimators 4. Concentration Inequalities 5. Consistency
Variance Definition: The variance of a random variable is Var ( X ) = 𝔽 [ ( X − 𝔽 [ X ]) 2 ] . f ( x ) = ( x − 𝔽 [ X ]) 2 i.e., where . 𝔽 [ f ( X )] Equivalently, Var ( X ) = 𝔽 [ X 2 ] − ( 𝔽 [ X ]) 2 ( why? )
Covariance Definition: The covariance of two random variables is Cov ( X , Y ) = 𝔽 [ ( X − 𝔽 [ X ])( Y − 𝔽 [ Y ]) ] = 𝔽 [ XY ] − 𝔽 [ X ] 𝔽 [ Y ] . Question: What is the range of Cov ( X , Y ) ?
Correlation Definition: The correlation of two random variables is Cov ( X , Y ) Corr ( X , Y ) = Var ( X ) Var ( Y ) Question: What is the range of Corr ( X , Y ) ? hint: Var ( X ) = Cov ( X , X )
Independence and Decorrelation • Independent RVs have zero correlation ( why? ) hint: Cov [ X , Y ] = 𝔽 [ XY ] − 𝔽 [ X ] 𝔽 [ Y ] • Uncorrelated RVs (i.e., Cov ( X , Y ) = 0 ) might be dependent (i.e., ). p ( x , y ) ≠ p ( x ) p ( y ) • Correlation ( Pearson's correlation coefficient ) shows linear relationships; but can miss nonlinear relationships • Example: X ∼ Uniform { − 2, − 1,0,1,2} Y = X 2 , 𝔽 [ XY ] = .2( − 2 × 4) + .2(2 × 4) + .2( − 1 × 1) + .2(1 × 1) + .2(0 × 0) • 𝔽 [ X ] = 0 • • So 𝔽 [ XY ] − 𝔽 [ X ] 𝔽 [ Y ] = 0 − 0 𝔽 [ Y ] = 0
Properties of Variances Var [ c ] = 0 for constant c • Var [ cX ] = c 2 Var [ X ] for constant c • Var [ X + Y ] = Var [ X ] + Var [ Y ] + 2 Cov [ X , Y ] • • For independent , X , Y ( why? ) Var [ X + Y ] = Var [ X ] + Var [ Y ]
Estimators Definition: An estimator is a procedure for estimating an unobserved quantity based on data. Example: Estimating for r.v. . 𝔽 [ X ] X ∈ ℝ random Questions: variable! Suppose we can observe a different variable . Is a Y Y good estimator of in the following cases? Why or 𝔽 [ X ] why not? 1. Y ∼ Uniform [0,10] 2. Y = 𝔽 [ X ] + Z , where Z ∼ Uniform [0,1] Y = 𝔽 [ X ] + Z , where Z ∼ N (0,100 2 ) 3. 4. Y = X 5. How would you estimate ? 𝔽 [ X ]
̂ ̂ Bias Questions: What is the bias of the Definition: The bias of an estimator is its expected following estimators of X ? 𝔽 [ X ] difference from the true value of the estimated quantity : X 1. Y ∼ Uniform [0,10] Bias( ̂ X ) = 𝔽 [ ̂ X − X ] 2. , Y = 𝔽 [ X ] + Z where Z ∼ Uniform [0,1] • Bias can be positive or negative or zero 3. , Y = 𝔽 [ X ] + Z Bias( ̂ • When , we say that the estimator is unbiased X ) = 0 X Z ∼ N (0,100 2 ) where 4. Y = X
Independent and Identically Distributed (i.i.d.) Samples • We usually won't try to estimate anything about a distribution based on only a single sample • Usually, we use multiple samples from the same distribution • Multiple samples: This gives us more information • Same distribution: We want to learn about a single population • One additional condition: the samples must be independent ( why? ) Definition: When a set of random variables are are all X 1 , X 2 , … independent, and each has the same distribution , we say they are i.i.d. X ∼ F (independent and identically distributed), written X 1 , X 2 , … i . i . d . . ∼ F
Estimating Expected Value via the Sample Mean X ] = 𝔽 [ X i ] n Example: We have i.i.d. samples from the same distribution , n F 1 ∑ 𝔽 [ ¯ n i . i . d , X 1 , X 2 , …, X n ∼ F i =1 n = 1 ∑ Var( X i ) = σ 2 with and for each . 𝔽 [ X i ] = μ X i 𝔽 [ X i ] n i =1 We want to estimate . μ n = 1 ∑ μ n X = 1 n ∑ ¯ Let's use the sample mean to estimate . X i μ i =1 n = 1 i =1 n n μ Question: Is this estimator unbiased ? Question: Are more samples better? Why? = μ . ∎
Variance of the Estimator X ] = Var [ Xi ] n 1 ∑ Var[ ¯ • Intuitively, more samples should make the estimator n "closer" to the estimated quantity i =1 n 2 Var [ X i ] n = 1 • We can formalize this intuition partly by characterizing ∑ Var[ ̂ the variance of the estimator itself . X ] i =1 n = 1 • The variance of the estimator should decrease as ∑ Var[ X i ] n 2 the number of samples increases i =1 n = 1 ¯ • Example: for estimating : ∑ σ 2 X μ n 2 • The variance of the estimator shrinks linearly as i =1 = 1 n 2 n σ 2 = 1 the number of samples grows. n σ 2 .
Concentration Inequalities Pr ( ¯ < ϵ ) > 1 − δ • We would like to be able to claim X − μ for some δ , ϵ > 0 X ] = 1 Var[ ¯ n σ 2 means that with "enough" data, • Pr ( ¯ < ϵ ) > 1 − δ for any that we pick ( why? ) X − μ δ , ϵ > 0 σ 2 = 81 Var[ ¯ • Suppose we have samples, and we know ; so . n = 10 X ] = 8.1 Pr ( ¯ < 2 ) • Question: What is ? X − μ
Variance Is Not Enough Var[ ¯ Pr( | ¯ is not enough to compute Knowing ! X ] = 8.1 X − μ | < 2) Examples: x ) = { if ¯ 0.9 x = μ x = μ ± 9 ⟹ Var[ ¯ X ] = 8.1 and Pr( | ¯ p (¯ X − μ | < 2) = 0.9 if ¯ 0.05 x ) = { if ¯ 0.999 x = μ x = μ ± 90 ⟹ Var[ ¯ X ] = 8.1 and Pr( | ¯ p (¯ X − μ | < 2) = 0.999 if ¯ 0.0005 x ) = { if ¯ 0.1 x = μ x = μ ± 3 ⟹ Var[ ¯ X ] = 8.1 and Pr( | ¯ p (¯ X − μ | < 2) = 0.1 if ¯ 0.45
Hoeffding's Inequality Theorem: Hoeffding's Inequality Suppose that are distributed i.i.d, with . X 1 , …, X n a ≤ X i ≤ b Then for any , ϵ > 0 ≥ ϵ ) ≤ 2 exp ( − ( b − a ) 2 ) Pr ( ¯ 2 n ϵ 2 X − 𝔽 [ ¯ . X ] Pr ( ) ≥ 1 − δ ln(2/ δ ) X − 𝔽 [ ¯ ¯ Equivalently, . X ] ≤ ( b − a ) 2 n
Chebyshev's Inequality Theorem: Chebyshev's Inequality σ 2 Suppose that are distributed i.i.d. with variance . X 1 , …, X n Then for any , ϵ > 0 Pr ( ¯ ≥ ϵ ) ≤ σ 2 X − 𝔽 [ ¯ . X ] n ϵ 2 σ 2 X − 𝔽 [ ¯ ¯ Equivalently, . Pr X ] ≤ ≥ 1 − δ δ n
When to Use Chebyshev, When to Use Hoeffding? Var[ X i ] ≤ 1 4( b − a ) 2 • If , then a ≤ X i ≤ b ln(2/ δ ) ln(2/ δ ) 1 • Hoeffding's inequality gives ; ϵ = ( b − a ) = ( b − a ) 2 n 2 n σ 2 ( b − a ) 2 1 1 Chebyshev's inequality gives ϵ = δ n ≤ = ( b − a ) 4 δ n n 2 δ • Hoeffding's inequality gives a tighter bound* , but it can only be used on bounded random variables ln(2/ δ ) 1 ✴ whenever < ⟺ δ < ∼ 0.232 2 2 δ • Chebyshev's inequality can be applied even for unbounded variables
̂ ̂ Consistency Definition: A sequence of random variables converges in probability X n p to a random variable (written ) if for all , X X n → X ϵ > 0 . n →∞ Pr( | X n − X | > ϵ ) = 0 lim p Definition: An estimator for a quantity is consistent if . → X X X X
Weak Law of Large Numbers Proof: Theorem: Weak Law of Large Numbers 𝔽 [ ¯ 1. We have already shown that X ] = μ Let be distributed i.i.d. with X 1 , …, X n 2. By Chebyshev, Var[ X i ] = σ 2 and . 𝔽 [ X i ] = μ Pr ( ¯ ≥ ϵ ) ≤ σ 2 X − 𝔽 [ ¯ X ] Then the sample mean n ϵ 2 for arbitrary ϵ > 0 n X = 1 n →∞ Pr ( ¯ ∑ ¯ ≥ ϵ ) = 0 X i 3. Hence lim X − μ n i =1 for any ϵ > 0 is a consistent estimator for . μ p ¯ 4. Hence . X → μ ∎
Summary • The variance of a random variable is its expected squared Var[ X ] X distance from the mean • An estimator is a random variable representing a procedure for estimating the value of an unobserved quantity based on observed data • Concentration inequalities let us bound the probability of a given estimator being at least from the estimated quantity ϵ • An estimator is consistent if it converges in probability to the estimated quantity
Recommend
More recommend