A unifying methodology A Gentle Introduction to the EM Algorithm - PDF document

A unifying methodology A Gentle Introduction to the EM Algorithm • Dempster, Laird & Rubin (1977) unified many strands of apparently unrelated work under the banner of The EM Algorithm Ted Pedersen • EM had gone incognito for many years Department of Computer Science – Newcomb (1887) University of Minnesota Duluth – McKendrick (1926) – Hartley (1958) tpederse@d.umn.edu – Baum et. al. (1970) EMNLP, June 2001 Ted Pedersen - EM Panel 1 EMNLP, June 2001 Ted Pedersen - EM Panel 2 A general framework for solving EM allows us to make MLE many kinds of problems under adverse circumstances • Filling in missing data in a sample • What are Maximum Likelihood Estimates? • Discovering the value of latent variables • What are these adverse circumstances? • Estimating parameters of HMMs • How does EM triumph over adversity? • Estimating parameters of finite mixtures • PANEL: When does it really work? • Unsupervised learning of clusters • … EMNLP, June 2001 Ted Pedersen - EM Panel 3 EMNLP, June 2001 Ted Pedersen - EM Panel 4 Maximum Likelihood Estimates Coin Tossing! • Parameters describe the characteristics of a • How likely am I to toss a head? A series of population. Their values are estimated from 10 trials/tosses yields (h,t,t,t,h,t,t,h,t,t) samples collected from that population. – (x1=3, x2=7), n=10 • A MLE is a parameter estimate that is most • Probability of tossing a head = 3/10 consistent with the sampled data. It • That’s a MLE! This estimate is absolutely maximizes the likelihood function. consistent with the observed data. • A few underlying details are masked… EMNLP, June 2001 Ted Pedersen - EM Panel 5 EMNLP, June 2001 Ted Pedersen - EM Panel 6 1

Maximum Likelihood Estimates Coin tossing unmasked • Coin tossing is well described by the • We seek to estimate the parameter such that binomial distribution since there are n it maximizes the likelihood function. independent trials with two outcomes. • Take the first derivative of the likelihood • Given 10 tosses, how likely is 3 heads? function with respect to the parameter theta and solve for 0. This value maximizes the  10  likelihood function and is the MLE. θ =   θ − θ 3 7 L ( ) ( 1 )   3   EMNLP, June 2001 Ted Pedersen - EM Panel 7 EMNLP, June 2001 Ted Pedersen - EM Panel 8 Maximizing the likelihood Multinomial MLE example   10   L ( θ ) = θ 3 ( 1 − θ ) 7   3   • There are n animals classified into one of four possible categories (Rao 1973).   10 θ =   + θ + − θ log L ( ) log 3 log 7 log( 1 )   – Category counts are the sufficient statistics to 3   estimate multinomial parameters θ d log L ( ) 3 7 • Technique for finding MLEs is the same = − = 0 θ θ − θ d 1 – Take derivative of likelihood function 3 7 3 – Solve for zero = ⇒ θ = θ − θ 1 10 EMNLP, June 2001 Ted Pedersen - EM Panel 9 EMNLP, June 2001 Ted Pedersen - EM Panel 10 Multinomial MLE example Multinomial MLE example = There are n 197 animals classified into one of 4 categories : = = 1 1 1 1 1 Y ( y 1 , y 2 , y 3 , y 4 ) (125, 18, 20, 34) π = + π + − π + − π + π log L ( ) y 1 * log( ) y 2 * log( ( 1 )) y 3 * log( ( 1 )) y 4 * log( ) 2 4 4 4 4 The probabilit y associated with each category is given as : π + d log L ( ) y 1 y 2 y 3 y 4 1 1 1 1 1 = − + = 0 Θ = ( + π , ( 1 − π ), ( 1 − π ), π ) π + π − π π d 2 1 2 4 4 4 4 π The resulting likelihood function for this multinomia l is : d log L ( ) 125 38 34 = − + = ⇒ π = 0 0 . 627 π + π − π π d 2 1 n ! 1 1 1 1 1 π = + π − π − π π L ( ) * ( ) y 1 * ( ( 1 )) y 2 * ( ( 1 )) y 3 * ( ) y 4 y 1 ! y 2 ! y 3 ! y 4 ! 2 4 4 4 4 EMNLP, June 2001 Ted Pedersen - EM Panel 11 EMNLP, June 2001 Ted Pedersen - EM Panel 12 2

Multinomial MLE runs aground? EM triumphs over adversity! • Adversity strikes! The observed data is • E-STEP: Find the expected values of the incomplete. There are really 5 categories. sufficient statistics for the complete data X, • y1 is the composite of 2 categories (x1+x2) given the incomplete data Y and the current parameter estimates – p(y1)= ½ + ¼ *pi, p(x1) = ½, p(x2)= ¼* pi • How can we make a MLE, since we can’t • M-STEP: Use those sufficient statistics to observe category counts x1 and x2?! make a MLE as usual! – Unobserved sufficient statistics!? EMNLP, June 2001 Ted Pedersen - EM Panel 13 EMNLP, June 2001 Ted Pedersen - EM Panel 14 MLE for complete data MLE for complete data 1 1 1 1 = = + = X ( x 1 , x 2 , x 3 , x 4 , x 5 ) ( x 1 , x 2 , 18, 20, 34) where x1 x2 125 log L ( π ) = x 2 * log( π ) * x 3 * log( ( 1 − π )) * x 4 * log( ( 1 − π )) * x 5 * log( π ) 4 4 4 4 1 1 1 1 1 π + + Θ = π − π − π π d log L ( ) x 2 x 5 x 3 x 4 ( , , ( 1 ), ( 1 ), ) = − = 0 2 4 4 4 4 π π − π d 1 π + n ! 1 1 1 1 1 d log L ( ) x2 34 38 π = x 1 π x 2 − π x 3 − π x 4 π x 5 = − = L ( ) * ( ) * ( ) * ( ( 1 )) * ( ( 1 )) * ( ) 0 π π − π d 1 x 1 ! x 2 ! x 3 ! x 4 ! x 5 ! 2 4 4 4 4 EMNLP, June 2001 Ted Pedersen - EM Panel 15 EMNLP, June 2001 Ted Pedersen - EM Panel 16 E-step E-Step • What are the sufficient statistics? • E[x1|y1] = n*p(x1) – X1 => X2 = 125 – x1 – p(x1) = ½ / (½+ ¼*pi) • How can their expected value be computed? – E [x1 | y1] = n*p(x1) • E[x2|y1] = n*p(x2) = 125 – E[x1|y1] • The unobserved counts x1 and x2 are the – p(x2)= ¼*pi / ( ½ + ¼*pi) categories of a binomial distribution with a sample size of 125. • Iteration 1? Start with pi = 0.5 (this is just a – p(x1) + p(x2) = p(y1) = ½ + ¼*pi random guess…) EMNLP, June 2001 Ted Pedersen - EM Panel 17 EMNLP, June 2001 Ted Pedersen - EM Panel 18 3

E-Step Iteration 1 M-Step iteration 1 • E[x1|y1] = 125* (½ / (½+ ¼*0.5)) = 100 • Given sufficient statistics, make MLEs as usual • E[x2|y1] = 125 – 100 = 25 d log L ( π ) x2 + 34 38 = − = 0 d π π 1 − π • These are the expected values of the sufficient 25 + 34 38 − = 0 statistics, given the observed data and current π − π 1 parameter estimate (which was just a guess) π = . 608 EMNLP, June 2001 Ted Pedersen - EM Panel 19 EMNLP, June 2001 Ted Pedersen - EM Panel 20 E-Step Iteration 2 M-Step iteration 2 • E[x1|y1] = 125* (½ / (½+ ¼*0.608)) = 95.86 • Given sufficient statistics, make MLEs as usual • E[x2|y1] = 125 – 95.86 = 29.14 π x2 + d log L ( ) 34 38 = − = 0 π π − π d 1 • These are the expected values of the sufficient 29.14 + 34 38 statistics, given the observed data and current − = 0 π − π 1 parameter estimate (from iteration 1) π = . 624 EMNLP, June 2001 Ted Pedersen - EM Panel 21 EMNLP, June 2001 Ted Pedersen - EM Panel 22 Result? Conclusion • Distribution must be appropriate to problem • Converge in 4 iterations to pi=.627 • Sufficient statistics should be identifiable – E[x1|y1] = 95.2 and have computable expected values – E[x2|y1] = 29.8 • Maximization operation should be possible • Initialization should be good or lucky to avoid saddle points and local maxima • Then…it might be safe to proceed… EMNLP, June 2001 Ted Pedersen - EM Panel 23 EMNLP, June 2001 Ted Pedersen - EM Panel 24 4

A unifying methodology A Gentle Introduction to the EM Algorithm - PDF document

A unifying methodology A Gentle Introduction to the EM Algorithm Dempster, Laird & Rubin (1977) unified many strands of apparently unrelated work under the banner of The EM Algorithm Ted Pedersen EM had gone incognito for many

A Gentle Introduction A Gentle Introduction to Bilateral Filtering to Bilateral Filtering and

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Gentle birth in New Zealand Fourth International "Gentle Childbirth" Midwifery

Unifying Mirror Symmetry Constructions David Favero favero@ualberta.ca University of Alberta

Unifying Notions of Feedback Sergey Goncharov FAU Tag der Informatik 2019, April 26 Unifying

Unifying Traditional and Unifying Traditional and Formal Verification Through Formal

A Gentle Introduction to PythonT EX A Question of Primes Introduction to PythonT EX

Scaling Methodology Scaling Methodology Dan Smith Director HW Engineering dsmith@nvidia.com

Unifying functional interpretations of nonstandard/uniform arithmetic Chuangjie Xu

Introducing The HET Bipolar System The HET Bipolar System is a gentle, simple and

VERBAL JUDO VERBAL JUDO The Gentle Art of Persuasion THE CONTENT OF THIS PRESENTATION IS

Auslanders formula in dualizing variaties Shijie Zhu (Joint with Ron Gentle, Job Rachowicz and

Module 1: What is Trauma Informed Care Mandy Davis Agenda BE GENTLE: New territory ahead

Generic modules and rational invariants for gentle algebras Andrew T. Carroll University of

Gentle with the Gentilics Livy Real 1 Valeria de Paiva 2 Fabricio Chalub 1 Alexandre Rademaker 1 ,

Unifying Twitter Around a Single ML Platform Yi Zhuang (@yz), Nicholas Leonard (@strife076)

The EM Algorithm The EM algorithm Mixture models Why EM works EM variants Learning

Programming Reactive Systems in Scala: Principles and Abstractions Philipp Haller KTH Royal

Rational Krylov Methods for Solving Nonlinear Eigenvalue Problems Roel Van Beeumen

Eigenvalues of Graphs Operator Theory and Krein Spaces (dedicated to the memory of Hagen

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse

Learning from Unlabeled Data INFO-4604, Applied Machine Learning University of Colorado Boulder

Expectation maximization don't have any labels. Can you still do something? ! Amazingly you can!

FlexMix: Flexible fitting of finite mixtures with the EM algorithm Bettina Gr un Friedrich

A unifying methodology A Gentle Introduction to the EM Algorithm - PDF document

A unifying methodology A Gentle Introduction to the EM Algorithm Dempster, Laird & Rubin (1977) unified many strands of apparently unrelated work under the banner of The EM Algorithm Ted Pedersen EM had gone incognito for many

A Gentle Introduction A Gentle Introduction to Bilateral Filtering to Bilateral Filtering and

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Gentle birth in New Zealand Fourth International &quot;Gentle Childbirth&quot; Midwifery

Unifying Mirror Symmetry Constructions David Favero favero@ualberta.ca University of Alberta

Unifying Notions of Feedback Sergey Goncharov FAU Tag der Informatik 2019, April 26 Unifying

Unifying Traditional and Unifying Traditional and Formal Verification Through Formal

A Gentle Introduction to PythonT EX A Question of Primes Introduction to PythonT EX

Scaling Methodology Scaling Methodology Dan Smith Director HW Engineering dsmith@nvidia.com

Unifying functional interpretations of nonstandard/uniform arithmetic Chuangjie Xu

Introducing The HET Bipolar System The HET Bipolar System is a gentle, simple and

VERBAL JUDO VERBAL JUDO The Gentle Art of Persuasion THE CONTENT OF THIS PRESENTATION IS

Auslanders formula in dualizing variaties Shijie Zhu (Joint with Ron Gentle, Job Rachowicz and

Module 1: What is Trauma Informed Care Mandy Davis Agenda BE GENTLE: New territory ahead

Generic modules and rational invariants for gentle algebras Andrew T. Carroll University of

Gentle with the Gentilics Livy Real 1 Valeria de Paiva 2 Fabricio Chalub 1 Alexandre Rademaker 1 ,

Unifying Twitter Around a Single ML Platform Yi Zhuang (@yz), Nicholas Leonard (@strife076)

The EM Algorithm The EM algorithm Mixture models Why EM works EM variants Learning

Programming Reactive Systems in Scala: Principles and Abstractions Philipp Haller KTH Royal

Rational Krylov Methods for Solving Nonlinear Eigenvalue Problems Roel Van Beeumen

Eigenvalues of Graphs Operator Theory and Krein Spaces (dedicated to the memory of Hagen

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse

Learning from Unlabeled Data INFO-4604, Applied Machine Learning University of Colorado Boulder

Expectation maximization don't have any labels. Can you still do something? ! Amazingly you can!

FlexMix: Flexible fitting of finite mixtures with the EM algorithm Bettina Gr un Friedrich

Gentle birth in New Zealand Fourth International "Gentle Childbirth" Midwifery