1 Maximum likelihood and EM algorithm (after the Chapter 8) Pasha Zusmanovich, deCODE Statistics Colloquium March 30, 2007
2 What is likelihood and what it is good for? Likelihood is just a conditional probability. Formal definition Given random events A and B , the likelihood function of A relative to B is: { set of states of B } → [0 , 1] x �→ Pr ( A | B = x ) . Nothing fancy so far. Consider an ...
3 What is likelihood and what it is good for? Example: alleles and genotypes frequencies of alleles: a : θ A : 1 − θ
3 What is likelihood and what it is good for? Example: alleles and genotypes frequencies of genotypes: frequencies of alleles: aa : θ 2 a : θ = ⇒ aA : 2 θ (1 − θ ) A : 1 − θ AA : (1 − θ ) 2
3 What is likelihood and what it is good for? Example: alleles and genotypes frequencies of genotypes: numbers: frequencies of alleles: aa : θ 2 n aa a : θ = ⇒ aA : 2 θ (1 − θ ) n aA A : 1 − θ AA : (1 − θ ) 2 n AA The probability that numbers of genotypes would be exactly ( n aa , n aA , n AA ): f ( θ ) = ( n aa + n aA + n AA )! θ 2 n aa (2 θ (1 − θ )) n aA (1 − θ ) 2 n AA n aa ! n aA ! n AA ! f is a likelihood function: { probability of alleles } → { conditional probability of genotypes assuming given probability of alleles } .
3 What is likelihood and what it is good for? Example: alleles and genotypes frequencies of genotypes: numbers: frequencies of alleles: aa : θ 2 n aa a : θ = ⇒ aA : 2 θ (1 − θ ) n aA A : 1 − θ AA : (1 − θ ) 2 n AA The probability that numbers of genotypes would be exactly ( n aa , n aA , n AA ): f ( θ ) = ( n aa + n aA + n AA )! θ 2 n aa (2 θ (1 − θ )) n aA (1 − θ ) 2 n AA n aa ! n aA ! n AA ! f is a likelihood function: { probability of alleles } → { conditional probability of genotypes assuming given probability of alleles } . This is a model with parameter θ . Question : Which parameter makes model the “best”? Answer ...
4 What is likelihood and what it is good for? Example: alleles and genotypes (continued) Question : Which parameter makes model the “best”? Answer : Those which makes the observed data more likely, i.e. which maximizes f ( θ ) = ( n aa + n aA + n AA )! θ 2 n aa (2 θ (1 − θ )) n aA (1 − θ ) 2 n AA n aa ! n aA ! n AA ! on [0 , 1]. Solution : 2 n aa + n aA ˆ θ = 2( n aa + n aA + n AA ) .
4 What is likelihood and what it is good for? Example: alleles and genotypes (continued) Question : Which parameter makes model the “best”? Answer : Those which makes the observed data more likely, i.e. which maximizes f ( θ ) = ( n aa + n aA + n AA )! θ 2 n aa (2 θ (1 − θ )) n aA (1 − θ ) 2 n AA n aa ! n aA ! n AA ! on [0 , 1]. Solution : 2 n aa + n aA ˆ θ = 2( n aa + n aA + n AA ) . But this is exactly the Hardy-Weinberg equilibrium!
5 What is likelihood and what it is good for? Another example: linear regression Fitting a line to the set of points on the plane { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , assuming observations are independent, and errors are normally distributed. The model is: ε ∼ N (0 , σ 2 ) . Y = β 1 X + β 0 + ε, What is the “probability” to have the observed data under the given model?
5 What is likelihood and what it is good for? Another example: linear regression Fitting a line to the set of points on the plane { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , assuming observations are independent, and errors are normally distributed. The model is: ε ∼ N (0 , σ 2 ) . Y = β 1 X + β 0 + ε, What is the “probability” to have the observed data under the given model? P ( Y lies in δ -neighbourhood of y i | X = x i ) ≈ density( Y ) | X = x i , Y = y i · 2 δ, so “probability” is replaced by density. If X is fixed, Y − β 1 X − β 0 ∼ N (0 , σ 2 ) ⇒ Y ∼ N ( β 1 X + β 0 , σ 2 ) .
6 What is likelihood and what it is good for? Another example: linear regression (continued) Maximizing n − ( β 1 x i + β 0 − y i ) 2 1 � � � density( Y ) | X = x i , Y = y i = √ exp 2 σ 2 2 πσ i =1 n 1 1 � n exp � ( β 1 x i + β 0 − y i ) 2 � � � = √ − 2 σ 2 2 πσ i =1 is equivalent to minimizing n � ( β 1 x i + β 0 − y i ) 2 . i =1
6 What is likelihood and what it is good for? Another example: linear regression (continued) Maximizing n − ( β 1 x i + β 0 − y i ) 2 1 � � � density( Y ) | X = x i , Y = y i = √ exp 2 σ 2 2 πσ i =1 n 1 1 � n exp � ( β 1 x i + β 0 − y i ) 2 � � � = √ − 2 σ 2 2 πσ i =1 is equivalent to minimizing n � ( β 1 x i + β 0 − y i ) 2 . i =1 But this is exactly the least squares!
7 What is likelihood and what it is good for? Refined formal definition Assuming a random variable X has a density function f ( x , θ ) parametrized by θ , the likelihood function is: θ �→ f ( x , θ ) . “Conceptual” definition Likelihood is the probability of observed data under the given model. Thus, the maximum likelihood correspond to the model (in the given parametrized class of models) which makes the observerd data “most likely”. One usually maximize log f ( x , θ ) instead of f ( x , θ ) ( log-likelihood function ). Ok, since log is monotonic. But ...
8 Why logarithm? ◮ Turns multiplicative things to additive. ◮ Diminishes the “long tail”.
8 Why logarithm? ◮ Turns multiplicative things to additive. In most cases on practice, the likelihood function is the product of several functions. E.g., if X 1 , . . . , X n are independent random variables, then their likelihood function: f ( x 1 , . . . , x n , θ ) = f ( x 1 , θ ) . . . f ( x n , θ ) , so logarithm turns multiplicative things to additive and easier to deal with. (And logarithm is the only “good” function taking multiplication to addition). ◮ Diminishes the “long tail”.
8 Why logarithm? ◮ Turns multiplicative things to additive. In most cases on practice, the likelihood function is the product of several functions. E.g., if X 1 , . . . , X n are independent random variables, then their likelihood function: f ( x 1 , . . . , x n , θ ) = f ( x 1 , θ ) . . . f ( x n , θ ) , so logarithm turns multiplicative things to additive and easier to deal with. (And logarithm is the only “good” function taking multiplication to addition). ◮ Diminishes the “long tail”. A random variable with values in R + (say, results of a measurement) tends to have a skewed distribution to the right because there is lower limit but not upper limit. Passing to log diminishes this skewness.
9 What is likelihood and what it is good for? Maximum likelihood behaves nicely asymtotically Taylor series: θ ) + 1 ℓ ( θ ) = ℓ (ˆ 2( θ − ˆ θ ) 2 ℓ ′′ (ˆ θ ) + . . . i ( θ ) = E ( − ℓ ′′ ( θ )) – Fisher information . ˆ θ ∼ N ( θ 0 , i ( θ 0 ) − 1 ) as number of samples → ∞ . Could be used to assess the precision of ˆ θ .
10 What is likelihood and what it is good for? Connection with some fancy areas of Mathematics Back to alleles and genotypes example: model with inbreeding coefficient λ : frequencies of genotypes: numbers: frequencies of alleles: aa : θ 2 + θ (1 − θ ) λ 38 a : θ aA : 2 θ (1 − θ )(1 − λ ) 95 A : 1 − θ AA : (1 − θ ) 2 + θ (1 − θ ) λ 53 (some real blood groups data from UK, 1947) Scoring equations are equivalent to:
10 What is likelihood and what it is good for? Connection with some fancy areas of Mathematics Back to alleles and genotypes example: model with inbreeding coefficient λ : frequencies of genotypes: numbers: frequencies of alleles: aa : θ 2 + θ (1 − θ ) λ 38 a : θ aA : 2 θ (1 − θ )(1 − λ ) 95 A : 1 − θ AA : (1 − θ ) 2 + θ (1 − θ ) λ 53 (some real blood groups data from UK, 1947) Scoring equations are equivalent to: 372 θ 3 λ 2 − 744 θ 3 λ − 558 θ 2 λ 2 +372 θ 3 +1131 θ 2 λ +186 θλ 2 − 573 θ 2 − 668 θλ + 201 θ + 148 λ = 0; 186 θ 2 λ 2 − 372 θ 2 λ − 186 θλ 2 +186 θ 2 +387 θλ − 201 θ − 148 λ +53 = 0 .
10 What is likelihood and what it is good for? Connection with some fancy areas of Mathematics Back to alleles and genotypes example: model with inbreeding coefficient λ : frequencies of genotypes: numbers: frequencies of alleles: aa : θ 2 + θ (1 − θ ) λ 38 a : θ aA : 2 θ (1 − θ )(1 − λ ) 95 A : 1 − θ AA : (1 − θ ) 2 + θ (1 − θ ) λ 53 (some real blood groups data from UK, 1947) Scoring equations are equivalent to: 372 θ 3 λ 2 − 744 θ 3 λ − 558 θ 2 λ 2 +372 θ 3 +1131 θ 2 λ +186 θλ 2 − 573 θ 2 − 668 θλ + 201 θ + 148 λ = 0; 186 θ 2 λ 2 − 372 θ 2 λ − 186 θλ 2 +186 θ 2 +387 θλ − 201 θ − 148 λ +53 = 0 . Statistics + Algebraic Geometry = Algebraic Statistics .
11 What is likelihood and what it is good for? Advantages (to summarize) ◮ Agrees with intuition. ◮ Confirmed by other methods. ◮ “Nice” asymptotic behavior. ◮ Very good practical results. ◮ Universal. ◮ Connection with other areas of Mathematics.
11 What is likelihood and what it is good for? Advantages (to summarize) ◮ Agrees with intuition. ◮ Confirmed by other methods. ◮ “Nice” asymptotic behavior. ◮ Very good practical results. ◮ Universal. ◮ Connection with other areas of Mathematics. Disadvantages ◮ No “theoretical” justification. ◮ Could be bad for small samples. ◮ No way to compare “disjoint” models. ◮ “Bayesian” issue ...
Recommend
More recommend