overview
play

Overview Maximum-Likelihood Estimation Models with hidden variables - PowerPoint PPT Presentation

Overview Maximum-Likelihood Estimation Models with hidden variables 6.864 (Fall 2007) The EM algorithm for a simple example (3 coins) The EM Algorithm, Part I The general form of the EM algorithm Hidden Markov models 1 3 An


  1. Overview • Maximum-Likelihood Estimation • Models with hidden variables 6.864 (Fall 2007) • The EM algorithm for a simple example (3 coins) The EM Algorithm, Part I • The general form of the EM algorithm • Hidden Markov models 1 3 An Experiment/Some Intuition Maximum Likelihood Estimation • I have three coins in my pocket, • We have data points x 1 , x 2 , . . . x n drawn from some set X Coin 0 has probability λ of heads; Coin 1 has probability p 1 of heads; • We have a parameter vector Θ Coin 2 has probability p 2 of heads • For each trial I do the following: • We have a parameter space Ω First I toss Coin 0 If Coin 0 turns up heads , I toss coin 1 three times If Coin 0 turns up tails , I toss coin 2 three times • We have a distribution P ( x | Θ) for any Θ ∈ Ω , such that � P ( x | Θ) = 1 and P ( x | Θ) ≥ 0 for all x I don’t tell you whether Coin 0 came up heads or tails, or whether Coin 1 or 2 was tossed three times, x ∈X but I do tell you how many heads/tails are seen at each trial • you see the following sequence: • We assume that our data points x 1 , x 2 , . . . x n are drawn � HHH � , � TTT � , � HHH � , � TTT � , � HHH � at random (independently, identically distributed) from a distribution P ( x | Θ ∗ ) for some Θ ∗ ∈ Ω What would you estimate as the values for λ, p 1 and p 2 ? 2 4

  2. Log-Likelihood Maximum Likelihood Estimation • We have data points x 1 , x 2 , . . . x n drawn from some set X • Given a sample x 1 , x 2 , . . . x n , choose � Θ ML = argmax Θ ∈ Ω L (Θ) = argmax Θ ∈ Ω log P ( x i | Θ) • We have a parameter vector Θ , and a parameter space Ω i • We have a distribution P ( x | Θ) for any Θ ∈ Ω • For example, take the coin example: say x 1 . . . x n has Count ( H ) heads, and ( n − Count ( H )) tails • The likelihood is ⇒ n � Likelihood (Θ) = P ( x 1 , x 2 , . . . x n | Θ) = P ( x i | Θ) � Θ Count ( H ) × (1 − Θ) n − Count ( H ) � L (Θ) = log i =1 = Count ( H ) log Θ + ( n − Count ( H )) log(1 − Θ) • The log-likelihood is • We now have n � Θ ML = Count ( H ) L (Θ) = log Likelihood (Θ) = log P ( x i | Θ) i =1 n 5 7 A First Example: Coin Tossing A Second Example: Probabilistic Context-Free Grammars • X = { H,T } . Our data points x 1 , x 2 , . . . x n are a sequence of • X is the set of all parse trees generated by the underlying heads and tails, e.g. context-free grammar. Our sample is n trees T 1 . . . T n such that each T i ∈ X . HHTTHHHTHH • R is the set of rules in the context free grammar N is the set of non-terminals in the grammar • Parameter vector Θ is a single parameter, i.e., the probability of coin coming up heads • Θ r for r ∈ R is the parameter for rule r • Parameter space Ω = [0 , 1] • Let R ( α ) ⊂ R be the rules of the form α → β for some α • Distribution P ( x | Θ) is defined as • The parameter space Ω is the set of Θ ∈ [0 , 1] | R | such that � Θ If x = H P ( x | Θ) = 1 − Θ If x = T � for all α ∈ N Θ r = 1 r ∈ R ( α ) 6 8

  3. Multinomial Distributions • We have Θ Count ( T,r ) � • X is a finite set, e.g., X = { dog , cat , the , saw } P ( T | Θ) = r r ∈ R • Our sample x 1 , x 2 , . . . x n is drawn from X where Count ( T, r ) is the number of times rule r is seen in the tree T e.g., x 1 , x 2 , x 3 = dog , the , saw � ⇒ log P ( T | Θ) = Count ( T, r ) log Θ r • The parameter Θ is a vector in R m where m = |X| r ∈ R e.g., Θ 1 = P ( dog ) , Θ 2 = P ( cat ) , Θ 3 = P ( the ) , Θ 4 = P ( saw ) • The parameter space is m � Ω = { Θ : Θ i = 1 and ∀ i, Θ i ≥ 0 } i =1 • If our sample is x 1 , x 2 , x 3 = dog , the , saw , then L (Θ) = log P ( x 1 , x 2 , x 3 = dog , the , saw ) = log Θ 1 +log Θ 3 +log Θ 4 9 11 Maximum Likelihood Estimation for PCFGs Overview • We have • Maximum-Likelihood Estimation � log P ( T | Θ) = Count ( T, r ) log Θ r r ∈ R • Models with hidden variables where Count ( T, r ) is the number of times rule r is seen in the tree T • The EM algorithm for a simple example (3 coins) • And, � � � L (Θ) = log P ( T i | Θ) = Count ( T i , r ) log Θ r • The general form of the EM algorithm i i r ∈ R • Hidden Markov models • Solving Θ ML = argmax Θ ∈ Ω L (Θ) gives � i Count ( T i , r ) Θ r = � � s ∈ R ( α ) Count ( T i , s ) i where r is of the form α → β for some β 10 12

  4. Models with Hidden Variables Overview • Now say we have two sets X and Y , and a joint distribution • Maximum-Likelihood Estimation P ( x, y | Θ) • Models with hidden variables • If we had fully observed data , ( x i , y i ) pairs, then � L (Θ) = log P ( x i , y i | Θ) • The EM algorithm for a simple example (3 coins) i • The general form of the EM algorithm • If we have partially observed data , x i examples, then � • Hidden Markov models L (Θ) = log P ( x i | Θ) i � � = log P ( x i , y | Θ) i y ∈Y 13 15 The Three Coins Example • The EM (Expectation Maximization) algorithm is a method for finding • e.g., in the three coins example: Y = { H,T } � � Θ ML = argmax Θ log P ( x i , y | Θ) X = { HHH,TTT,HTT,THH,HHT,TTH,HTH,THT } y ∈Y i Θ = { λ, p 1 , p 2 } • and P ( x, y | Θ) = P ( y | Θ) P ( x | y, Θ) where � λ If y = H P ( y | Θ) = 1 − λ If y = T and � p h 1 (1 − p 1 ) t If y = H P ( x | y, Θ) = p h 2 (1 − p 2 ) t If y = T where h = number of heads in x , t = number of tails in x 14 16

  5. The Three Coins Example The Three Coins Example • Various probabilities can be calculated, for example: • Various probabilities can be calculated, for example: λp 1 (1 − p 1 ) 2 λp 1 (1 − p 1 ) 2 P ( x = THT , y = H | Θ) = P ( x = THT , y = H | Θ) = (1 − λ ) p 2 (1 − p 2 ) 2 (1 − λ ) p 2 (1 − p 2 ) 2 P ( x = THT , y = T | Θ) = P ( x = THT , y = T | Θ) = P ( x = THT | Θ) = P ( x = THT , y = H | Θ) P ( x = THT | Θ) = P ( x = THT , y = H | Θ) + P ( x = THT , y = T | Θ) + P ( x = THT , y = T | Θ) λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 = = P ( x = THT , y = H | Θ) P ( x = THT , y = H | Θ) P ( y = H | x = THT , Θ) = P ( y = H | x = THT , Θ) = P ( x = THT | Θ) P ( x = THT | Θ) λp 1 (1 − p 1 ) 2 λp 1 (1 − p 1 ) 2 = = λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 17 19 The Three Coins Example The Three Coins Example • Various probabilities can be calculated, for example: • Various probabilities can be calculated, for example: λp 1 (1 − p 1 ) 2 λp 1 (1 − p 1 ) 2 P ( x = THT , y = H | Θ) = P ( x = THT , y = H | Θ) = (1 − λ ) p 2 (1 − p 2 ) 2 (1 − λ ) p 2 (1 − p 2 ) 2 P ( x = THT , y = T | Θ) = P ( x = THT , y = T | Θ) = P ( x = THT | Θ) = P ( x = THT , y = H | Θ) P ( x = THT | Θ) = P ( x = THT , y = H | Θ) + P ( x = THT , y = T | Θ) + P ( x = THT , y = T | Θ) λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 = = P ( x = THT , y = H | Θ) P ( x = THT , y = H | Θ) P ( y = H | x = THT , Θ) = P ( y = H | x = THT , Θ) = P ( x = THT | Θ) P ( x = THT | Θ) λp 1 (1 − p 1 ) 2 λp 1 (1 − p 1 ) 2 = = λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 18 20

  6. The Three Coins Example The Three Coins Example • Partially observed data might look like: • Fully observed data might look like: � HHH � , � TTT � , � HHH � , � TTT � , � HHH � ( � HHH � , H ) , ( � TTT � , T ) , ( � HHH � , H ) , ( � TTT � , T ) , ( � HHH � , H ) • If current parameters are λ, p 1 , p 2 P ( � HHH � , H ) P ( y = H | x = � HHH � ) = P ( � HHH � , H ) + P ( � HHH � , T ) • In this case maximum likelihood estimates are: λp 3 1 = λ = 3 λp 3 1 + (1 − λ ) p 3 2 5 p 1 = 9 P ( � TTT � , H ) 9 P ( y = H | x = � TTT � ) = P ( � TTT � , H ) + P ( � TTT � , T ) p 2 = 0 λ (1 − p 1 ) 3 6 = λ (1 − p 1 ) 3 + (1 − λ )(1 − p 2 ) 3 21 23 The Three Coins Example The Three Coins Example • If current parameters are λ, p 1 , p 2 • Partially observed data might look like: λp 3 1 P ( y = H | x = � HHH � ) = λp 3 1 + (1 − λ ) p 3 � HHH � , � TTT � , � HHH � , � TTT � , � HHH � 2 λ (1 − p 1 ) 3 • How do we find the maximum likelihood parameters? P ( y = H | x = � TTT � ) = λ (1 − p 1 ) 3 + (1 − λ )(1 − p 2 ) 3 • If λ = 0 . 3 , p 1 = 0 . 3 , p 2 = 0 . 6 : P ( y = H | x = � HHH � ) = 0 . 0508 P ( y = H | x = � TTT � ) = 0 . 6967 22 24

Recommend


More recommend