18 650 statistics for applications chapter 3 maximum
play

18.650 Statistics for Applications Chapter 3: Maximum - PowerPoint PPT Presentation

18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23 Total variation distance (1) ) ( Let E, (I P ) be a statistical model associated with a sample of i.i.d. r.v. X 1 , . . .


  1. 18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23

  2. Total variation distance (1) ) ( Let E, (I P θ ) θ ∈ Θ be a statistical model associated with a sample θ ∗ ∈ Θ of i.i.d. r.v. X 1 , . . . , X n . Assume that there exists such : θ ∗ is true parameter. X 1 ∼ I that P θ ∗ the Statistician’s goal: given X 1 , . . . , X n , find an estimator ˆ ˆ θ = θ ( X 1 , . . . , X n ) such that I P ˆ is close to I P θ ∗ for the true θ θ ∗ . parameter is small for P ˆ ( A ) − I ⊂ E . This means: I P θ ∗ ( A ) all A θ Definition total variation distance between two The probability measures I P θ and I P θ ′ is defined by I . P θ ( A ) − I TV (I P θ , I P θ ′ ) = max P θ ′ ( A ) A ⊂ E 2/23

  3. Total variation distance (2) Assume that E is discrete (i.e., finite or countable). This includes Bernoulli, Binomial, Poisson, . . . Therefore X has a PMF (probability mass function): I P θ ( X = x ) = p θ ( x ) for all ∈ E , x L p θ ( x ) ≥ 0 , p θ ( x ) = 1 . x ∈ E The total variation distance between I P θ and I P θ ′ is a simple function of the PMF’s p θ and p θ ′ : L 1 TV (I P θ , I P θ ′ ) = p θ ( x ) − p θ ′ ( x ) . 2 x ∈ E 3/23

  4. Total variation distance (3) Assume that E is continuous. This includes Gaussian, Exponential, . . . J Assume that X has a density I P θ ( X ∈ A ) = f θ ( x ) dx for all A ⊂ E . A l f θ ( x ) ≥ 0 , f θ ( x ) dx = 1 . E The total variation distance between I P θ and I P θ ′ is a simple function of the densities f θ and f θ ′ : l 1 TV (I P θ , I P θ ′ ) = f θ ( x ) − f θ ′ ( x ) dx . 2 E 4/23

  5. Total variation distance (4) Properties of Total variation: ◮ TV (I P θ , I P θ ′ ) = TV (I P θ ′ , I P θ ) (symmetric) ◮ TV (I P θ , I P θ ′ ) ≥ 0 ◮ If TV (I P θ , I P θ ′ ) = 0 then I P θ = I P θ ′ (definite) ◮ TV (I P θ , I P θ ′ ) ≤ TV (I P θ , I P θ ′′ ) + TV (I P θ ′′ , I P θ ′ ) (triangle inequality) distance between These imply that the total variation is a probability distributions. 5/23

  6. Total variation distance (5) estimator T An estimation strategy: Build an TV (I P θ , I P θ ∗ ) for all find ˆ θ → T minimizes the ∈ Θ . Then that function TV (I P θ , I P θ ∗ ) . θ θ 6/23

  7. Total variation distance (5) estimator T An estimation strategy: Build an TV (I P θ , I P θ ∗ ) for all find ˆ θ → T minimizes the ∈ Θ . Then that function TV (I P θ , I P θ ∗ ) . θ θ build T problem: Unclear how to TV (I P θ , I P θ ∗ ) ! 6/23

  8. Kullback-Leibler (KL) divergence (1) many distances There are between probability measures to replace total variation. Let us choose one that is more convenient. Definition Kullback-Leibler (KL) divergence between two The probability measures I P θ and I P θ ′ is defined by     ( p θ ( x ) ) L    p θ ( x ) log if E is discrete   p θ ′ ( x ) x ∈ E KL (I P θ , I P θ ′ ) =   l  ( f θ ( x ) )     f θ ( x ) log dx if E is continuous  f θ ′ ( x ) E 7/23

  9. Kullback-Leibler (KL) divergence (2) Properties of KL-divergence: ) = ◮ KL (I P θ , I P θ ′ KL (I P θ ′ , I P θ ) in general ◮ KL (I P θ , I P θ ′ ) ≥ 0 ◮ If KL (I P θ , I P θ ′ ) = 0 then I P θ = I P θ ′ (definite) ) i KL (I ◮ KL (I P θ , I P θ ′ P θ , I P θ ′′ ) + KL (I P θ ′′ , I P θ ′ ) in general Not a distance . This is is called a divergence . Asymmetry is the key to our ability to estimate it! 8/23

  10. Kullback-Leibler (KL) divergence (3) [ ( X ) )] ( p θ ∗ KL (I P θ ∗ , I P θ ) = I E θ ∗ log p θ ( X ) [ ] [ ] = I E θ ∗ log p θ ∗ ( X ) − I E θ ∗ log p θ ( X ) So the function θ → KL (I P θ ∗ , I P θ ) is of the form: [ ] “constant” − I E θ ∗ log p θ ( X ) L n 1 Can be estimated: I E θ ∗ [ h ( X )] - h ( X i ) (by LLN) n i =1 L n 1 T − KL (I P θ ∗ , I P θ ) = “constant” log p θ ( X i ) n i =1 9/23

  11. Kullback-Leibler (KL) divergence (4) L n 1 T − KL (I P θ ∗ , I P θ ) = “constant” log p θ ( X i ) n i =1 n L 1 T min KL (I P θ ∗ , I P θ ) ⇔ min − log p θ ( X i ) θ ∈ Θ n θ ∈ Θ i =1 n L 1 ⇔ max log p θ ( X i ) θ ∈ Θ n i =1 L n ⇔ max log p θ ( X i ) θ ∈ Θ i =1 n n ⇔ max p θ ( X i ) θ ∈ Θ i =1 maximum likelihood principle . This is the 10/23

  12. Interlude: maximizing/minimizing functions (1) Note that min − h ( θ ) ⇔ max h ( θ ) θ ∈ Θ θ ∈ Θ In this class, we focus on maximization . Maximization of arbitrary functions can be difficult: Example: θ → � n ( θ − X i ) i =1 11/23

  13. Interlude: maximizing/minimizing functions (2) Definition A function twice differentiable function h : Θ ⊂ I R → I R is said to concave if be its second derivative satisfies ′′ ( θ ) ≤ 0 , ∀ θ ∈ Θ h strictly concave if ′′ ( θ ) < It is said to be the inequality is strict: h 0 convex if Moreover, h is said to be (strictly) − h is (strictly) ′′ ( θ ) ≥ 0 ′′ ( θ ) > concave, i.e. h ( h 0 ). Examples: − θ 2 ◮ Θ = I R , h ( θ ) = , √ ◮ Θ = (0 , ∞ ) , h ( θ ) = θ , ◮ Θ = (0 , ∞ ) , h ( θ ) = log θ , ◮ Θ = [0 , π ] , h ( θ ) = sin( θ ) ◮ Θ − 3 = I R , h ( θ ) = 2 θ 12/23

  14. Interlude: maximizing/minimizing functions (3) multivariate function: h R d → I More generally for a : Θ ⊂ I R , ≥ 2 , define the d   ∂h ∂θ 1 ( θ )  .  ◮ gradient vector: ∇ h ( θ ) = . R d   ∈ I . ∂h ∂θ d ( θ ) ◮ Hessian matrix:   ∂ 2 h ∂ 2 h ( θ ) · · · ( θ ) ∂θ 1 ∂θ 1 ∂θ 1 ∂θ d   .   ∇ 2 h ( θ ) = . R d × d ∈ I .   ∂ 2 h ∂ 2 h ( θ ) · · · ( θ ) ∂θ d ∂θ d ∂θ d ∂θ d x ⊤ ∇ 2 h ( θ ) x R d , θ ⇔ ≤ 0 ∀ x ∈ I ∈ Θ . h is concave x ⊤ ∇ 2 h ( θ ) x < R d , θ h is strictly concave ⇔ 0 ∀ x ∈ I ∈ Θ . Examples: R 2 2 − 2 θ 2 2 or − ( θ 1 − θ 2 ) 2 ◮ Θ − θ 1 = I , h ( θ ) = h ( θ ) = ◮ Θ = (0 , ∞ ) , h ( θ ) = log( θ 1 + θ 2 ) , 13/23

  15. Interlude: maximizing/minimizing functions (4) Strictly concave functions are easy to maximize: if they have a maximum, then it is unique . It is the unique solution to ′ ( θ ) = 0 , h or, in the multivariate case R d ∇ h ( θ ) = 0 ∈ I . There are may algorithms to find it numerically: this is the theory closed form of “convex optimization”. In this class, often a formula for the maximum. 14/23

  16. Likelihood, Discrete case (1) ) ( Let E, (I P θ ) θ ∈ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1 , . . . , X n . Assume that E is discrete (i.e., finite or countable). Definition likelihood of The the model is the map L n (or just L ) defined as: E n × Θ L n : → I R ( x 1 , . . . , x n , θ ) → I P θ [ X 1 = x 1 , . . . , X n = x n ] . 15/23

  17. Likelihood, Discrete case (2) iid Example 1 (Bernoulli trials): If X 1 , . . . , X n ∼ Ber ( p ) for some ∈ (0 , 1) : p ◮ E = { 0 , 1 } ; ◮ Θ = (0 , 1) ; ◮ ∀ ( x 1 , . . . , x n ) ∈ { 0 , 1 } n , ∀ p ∈ (0 , 1) , n n L ( x 1 , . . . , x n , p ) = I P p [ X i = x i ] i =1 n n x i (1 − p ) 1 − x i = p i =1 � n x i (1 − p ) n − � n x i = p i =1 i =1 . 16/23

  18. Likelihood, Discrete case (3) Example 2 (Poisson model): iid If X 1 , . . . , X n ∼ Poiss ( λ ) for some 0 : λ > ◮ E = I N ; ◮ Θ = (0 , ∞ ) ; N n ◮ ∀ ( x 1 , . . . , x n ) ∈ I , ∀ λ > 0 , n n L ( x 1 , . . . , x n , p ) = I P λ [ X i = x i ] i =1 n n λ x − λ i = e x i ! i =1 n � i =1 x i λ − nλ = e . x 1 ! . . . x n ! 17/23

  19. Likelihood, Continuous case (1) ) ( Let E, (I P θ ) θ ∈ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1 , . . . , X n . Assume that all the I P θ have density f θ . Definition likelihood of The the model is the map L defined as: E n × Θ L : → I R � n ( x 1 , . . . , x n , θ ) → f θ ( x i ) . i =1 18/23

  20. Likelihood, Continuous case (2) iid Example 1 (Gaussian model): If X 1 , . . . , X n ∼ N ( µ, σ 2 ) , for R , σ 2 > some µ ∈ I 0 : ◮ E = I R ; ◮ Θ = I R × (0 , ∞ ) R n , ∀ ( µ, σ 2 ) ∈ I ◮ ∀ ( x 1 , . . . , x n ) ∈ I R × (0 , ∞ ) , L n 1 1 L ( x 1 , . . . , x n , µ, σ 2 ) = ( x i − µ ) 2 √ exp − . 2 σ 2 2 π ) n ( σ i =1 19/23

  21. Maximum likelihood estimator (1) Let X 1 , . . . , X n be an i.i.d. sample associated with a statistical ) ( model E, (I P θ ) θ ∈ Θ and let be the corresponding likelihood. L Definition likelihood estimator of The θ is defined as: ˆ MLE = argmax L ( X 1 , . . . , X n , θ ) , θ n θ ∈ Θ provided it exists. Remark (log-likelihood estimator): In practice, we use the fact that ˆ MLE = θ argmax log L ( X 1 , . . . , X n , θ ) . n θ ∈ Θ 20/23

  22. Maximum likelihood estimator (2) Examples ¯ ˆ MLE = ◮ Bernoulli trials: p X n . n model: ˆ ¯ λ MLE ◮ Poisson = X n . n ( ) ˆ 2 ) ( ¯ n , S ˆ n ◮ Gaussian model: ˆ n , σ = . µ X n 21/23

Recommend


More recommend