Stat 451 Lecture Notes 04 12 EM Algorithm Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Ch. 4 in Givens & Hoeting and Ch. 13 in Lange 2 Updated: March 9, 2016 1 / 47
Outline 1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary 2 / 47
Notion of “missing data” Let X denote the observable data and θ the parameter to be estimated. The EM algorithm is particularly suited for problems in which there is a notion of “missing data”. The missing data can be actual data that is missing, or some “imaginary” data that exists only in our minds (and necessarily missing). The point is that IF the missing data were available, then finding the MLE for θ would be relatively straightforward. 3 / 47
Notation Again, X is the observable data. Let Y denote the complete data . 3 Usually we think of Y as being composed of observable data X and missing data Z , that is, Y = ( X , Z ). Perhaps more generally, we think of the observable data X as a sort of projection of the complete data, i.e., “ X = M ( Y )”. This suggests a notion of marginalization ... The basic idea behind the EM algorithm is to iteratively impute the missing data. 3 This is the notation used in G&H which, as they admit, is not standard in the EM literature. 4 / 47
Example – mixture model Here is an example where the “missing data” is not real. Suppose X = ( X 1 , . . . , X n ) consists of iid samples from the mixture α N( µ 1 , 1) + (1 − α ) N( µ 2 , 1) , where θ = ( α, µ 1 , µ 2 ) is to be estimated. IF we knew which of the two groups X i was from, then it would be straightforward to get the MLE for θ , i.e., just calculate the group means. The missing part Z = ( Z 1 , . . . , Z n ) is the group label, i.e., � 1 if X i ∼ N( µ 1 , 1) Z i = i = 1 , . . . , n . 0 if X i ∼ N( µ 2 , 1) , 5 / 47
Outline 1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary 6 / 47
More notation Complete data Y = ( X , Z ) splits to the observed data X and missing data Z . The complete data likelihood θ �→ L Y ( θ ) is the joint distribution of ( X , Z ). The observed likelihood θ �→ L X ( θ ) is obtained by marginalizing the joint distribution of ( X , Z ). The conditional distribution of Z , given X , is an essential piece: θ �→ L Z | X ( θ ). Though the same notation “ L ” is used for all the likelihoods, it should be clear that these are all distinct functions of θ . 7 / 47
Example – mixture model (cont.) Complete data Y = ( Y 1 , . . . , Y n ), where each Y i consists of the observed data X i with the missing group label Z i . Observed data likelihood is n � L X ( θ ) = { α N( X i | µ 1 , 1) + (1 − α )N( X i | µ 2 , 1) } ; i =1 not a nice function—the sum is inside the product. Complete data likelihood is much nicer—write it out! The conditional distribution of Z , given X , is determined by the conditional probabilities α N( X i | µ 1 , 1) P θ ( Z i = 1 | X i ) = α N( X i | µ 1 , 1) + (1 − α )N( X i | µ 2 , 1) . 8 / 47
EM formulation The EM works with some new function: Q ( θ ′ | θ ) = E θ { log L Y ( θ ′ ) | X } , the conditional expectation of the complete data log likelihood, at θ ′ , given X and the particular value θ . Implicit in this expression is that, given X , the only “random” part of Y is the missing data Z . So, in this expression, the expectation is actually with respect to Z , given X , i.e., � Q ( θ ′ | θ ) = log { L ( X , z ) ( θ ′ ) } L z | X ( θ ) dz . 9 / 47
EM formulation (cont.) The EM algorithm iterates computing Q ( θ ′ | θ ), which involves an expectation, and then maximizing it. Start with a fixed θ (0) . At iteration t ≥ 1 do: E-step. Evaluate Q t ( θ ) := Q ( θ | θ ( t − 1) ); M-step. Update θ ( t ) = arg max θ Q t ( θ ). Repeat these steps until practical convergence is reached. 10 / 47
A super-simple example Goal is to maximize the observed data likelihood. But EM iteratively maximizes some other function, so it’s not clear that we are doing something reasonable. Before we get to theory, it helps to consider a simple example to see that EM is doing the right thing. iid Y = ( X , Z ), where X , Z ∼ N( θ, 1), but Z is missing. Observed data MLE ˆ θ = X . The Q function in the E-step is 2 { ( θ − X ) 2 + ( θ − θ ( t ) ) 2 } . Q ( θ | θ ( t ) ) = − 1 Find the M-step update—what should happen as t → ∞ ? 11 / 47
Outline 1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary 12 / 47
Ascent property The claimed ascent property of EM is as follows: L X ( θ ( t +1) ) ≥ L X ( θ ( t ) ) , ∀ t . To prove this, we first need a simple identity involving joint, conditional, and marginal densities: log f V ( v ) = log f U , V ( u , v ) − log f U | V ( u | v ) . The next general fact is the non-negativity of relative entropy or Kullback–Leibler divergence : � log p ( x ) q ( x ) p ( x ) dx ≥ 0 , equality iff p = q . Follows from Jensen’s inequality, since y �→ − log y is convex. 13 / 47
Ascent property (cont.) Using the density identity, we can write log L X ( θ ) = log L Y ( θ ) − log L Z | X ( θ ) . Take expectation wrt Z , given X and θ ( t ) , gives log L X ( θ ) = Q ( θ | θ ( t ) ) − H ( θ | θ ( t ) ) , where H ( θ | θ ( t ) ) = E θ ( t ) { log L Z | X ( θ ) | X } . It follows from non-negativity of KL that H ( θ ( t ) | θ ( t ) ) − H ( θ | θ ( t ) ) ≥ 0 , ∀ θ. 14 / 47
Ascent property (cont.) Key observation: picking θ ( t +1) such that Q ( θ ( t +1) | θ ( t ) ) ≥ Q ( θ ( t ) | θ ( t ) ) will increase both terms in the expression for L X ( · ). So maximizing Q ( · | θ ( t ) ) in the M-step will result in updates with the desired ascent property: L X ( θ ( t +1) ) ≥ L X ( θ ( t ) ) , ∀ t . This does not imply that the EM updates will necessarily converge to the MLE, just that they are surely moving in the right direction. 15 / 47
Further properties One can express the EM updates through a abstract mapping Ψ, i.e., θ ( t +1) = Ψ( θ ( t ) ). If EM converges to ˆ θ , then ˆ θ must be a fixed-point of Ψ. Do a Taylor approximation of Ψ(ˆ θ ( t ) ) near ˆ θ : θ ( t +1) − ˆ ≈ Ψ ′ ( θ ( t ) )( θ ( t ) − ˆ θ θ ) . � �� � Ψ( θ ( t ) ) − Ψ(ˆ θ ) If parameter is one-dimensional, then the convergence order can be seen to be Ψ ′ (ˆ θ ), provided that ˆ θ is a (local) maxima. 16 / 47
EM for exponential family models Recall that a model/joint distribution P θ for data Y is a natural exponential family if the log-likelihood is of the form log L Y ( θ ) = const + log a ( θ ) + θ ⊤ s ( y ) , where s ( y ) is the “sufficient statistic.” For problems where the complete data Y is modeled as an exponential family, EM takes a relatively simple form. This is an important case since many examples involve exponential families. 17 / 47
EM for exponential family models (cont.) For exponential families, Q function looks like � Q ( θ | θ ( t ) ) = const + log a ( θ ) + θ ⊤ s ( y ) L z | X ( θ ( t ) ) dz . To maximize this, take derivative wrt θ and set to zero: � ⇒ − a ′ ( θ ) s ( y ) L z | X ( θ ( t ) ) dz . = a ( θ ) = From Stat 411, you know that the left-hand side is E θ { s ( Y ) } . Let s ( t ) be the right-hand side. M-step updates θ ( t ) → θ ( t +1) by solving the equation: E θ { s ( Y ) } = s ( t ) . 18 / 47
EM for exponential family models (cont.) E-step. Compute s ( t ) based on guess θ ( t ) . M-step. Update guess to θ ( t +1) by solving the equation E θ { s ( Y ) } = s ( t ) . 19 / 47
Outline 1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary 20 / 47
Example 1 – censored exponential model iid Complete data Y 1 , . . . , Y n ∼ Exp( θ ), rate. Complete data log-likelihood log L Y ( θ ) = n log θ − θ � n i =1 Y i . � �� � s ( Y ) Suppose some observations are right-censored, i.e., only a lower bound observed. Write observed data as pairs ( X i , δ i ), where X i = min( Y i , c i ) , ( c i ’s are non-random) δ i = I { X i = Y i } . Missing data Z consists of the actual event times for the censored observations. 21 / 47
Example 1 – censored exponential model (cont.) For EM, we first need to compute s ( t ) ... Only censored cases are of concern. If an observation Y i is right-censored at c i , then we know that c i is a lower bound. Recall that exponential has a memory-less property . So, E-step of the EM requires n � s ( t ) = � � δ i X i + (1 − δ i )E θ ( t ) { Y i | censored } i =1 n � � � δ i X i + (1 − δ i )( X i + 1 /θ ( t ) ) = i =1 n 1 � = nX + (1 − δ i ) . θ ( t ) i =1 22 / 47
Example 1 – censored exponential model (cont.) Clearly, E θ { s ( Y ) } = n /θ . So, the M-step requires we solve for θ in n 1 (1 − δ i ) = n � nX + θ . θ ( t ) i =1 In particular, the EM update in this case is n � θ ( t ) · 1 1 � − 1 � θ ( t +1) = X + (1 − δ i ) . n i =1 Iterate this update till convergence. 23 / 47
Recommend
More recommend