Expectation Maximization CMSC 691 UMBC Outline EM (Expectation - PowerPoint PPT Presentation

Expectation Maximization CMSC 691 UMBC

Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works

Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty (compute expectations) 2. M-step: maximize log-likelihood, assuming these uncertain counts

Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(𝑨 𝑗 , 𝑥 𝑗 ) 𝑞(𝑨 𝑗 ) 2. M-step: maximize log-likelihood, assuming these uncertain counts

Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(𝑨 𝑗 , 𝑥 𝑗 ) 𝑞(𝑨 𝑗 ) 2. M-step: maximize log-likelihood, assuming these We’ve already seen this type of counting, when uncertain counts computing the gradient in maxent models.

Expectation Maximization (EM): M-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts 𝑞 𝑢+1 (𝑨) 𝑞 (𝑢) (𝑨) estimated counts

EM Math the average log-likelihood of our complete data (z, w), averaged across max 𝔽 𝑨 ~ 𝑞 𝜄(𝑢) (⋅|𝑥) log 𝑞 𝜄 (𝑨, 𝑥) all z and according to how likely our current model thinks z is

EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ 𝑞 𝜄(𝑢) (⋅|𝑥) log 𝑞 𝜄 (𝑨, 𝑥) 𝜄

EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ 𝑞 𝜄(𝑢) (⋅|𝑥) log 𝑞 𝜄 (𝑨, 𝑥) current parameters 𝜄 posterior distribution

EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ 𝑞 𝜄(𝑢) (⋅|𝑥) log 𝑞 𝜄 (𝑨, 𝑥) current parameters 𝜄 new parameters posterior distribution new parameters

EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ 𝑞 𝜄(𝑢) (⋅|𝑥) log 𝑞 𝜄 (𝑨, 𝑥) current parameters 𝜄 new parameters posterior distribution new parameters E-step: count under uncertainty M-step: maximize log-likelihood

Why EM? Un-Supervised Learning ? ? ? ? ? ? ? ? ? NO labeled data: ➔ ? ? ? • human annotated EM • relatively small/few ? ? ? examples ? ? ? ? ? ? EM/generative models in this case ? ? ? can be seen as a type of clustering unlabeled data: • raw; not annotated • plentiful

Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: • human annotated • raw; not annotated • relatively small/few • plentiful examples

Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ? EM  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: • human annotated • raw; not annotated • relatively small/few • plentiful examples

Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: • human annotated • raw; not annotated • relatively small/few • plentiful examples

Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? EM

Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works

Three Coins Example Imagine three coins Flip 1 st coin (penny) If heads: flip 2 nd coin (dollar coin) If tails: flip 3 rd coin (dime)

Three Coins Example Imagine three coins Flip 1 st coin (penny) don’t observe this If heads: flip 2 nd coin (dollar coin) only observe these (record heads vs. tails If tails: flip 3 rd coin (dime) outcome)

Three Coins Example Imagine three coins Flip 1 st coin (penny) unobserved: part of speech? genre? If heads: flip 2 nd coin (dollar coin) observed: a , b , e , etc. If tails: flip 3 rd coin (dime) We run the code, vs. The run failed

Three Coins Example Imagine three coins Flip 1 st coin (penny) 𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 If heads: flip 2 nd coin (dollar coin) 𝑞 heads = 𝛿 𝑞 tails = 1 − 𝛿 If tails: flip 3 rd coin (dime) 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝜔

Three Coins Example Imagine three coins 𝑞 heads = 𝜔 𝑞 heads = 𝜇 𝑞 heads = 𝛿 𝑞 tails = 1 − 𝜇 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔 Three parameters to estimate: λ , γ , and ψ

Generative Story for Three Coins 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 add complexity to better explain what we see 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 Generative Story 𝑞 heads = 𝜇 𝜇 = distribution over penny 𝑞 tails = 1 − 𝜇 𝛿 = distribution for dollar coin 𝜔 = distribution over dime 𝑞 heads = 𝛿 for item 𝑗 = 1 to 𝑂: 𝑞 tails = 1 − 𝛿 𝑨 𝑗 ~ Bernoulli 𝜇 𝑞 heads = 𝜔 if 𝑨 𝑗 = 𝐼: 𝑥 𝑗 ~ Bernoulli 𝛿 else: 𝑥 𝑗 ~ Bernoulli 𝜔 𝑞 tails = 1 − 𝜔

Three Coins Example H H T H T H H T H T T T If all flips were observed 𝑞 heads = 𝜔 𝑞 heads = 𝜇 𝑞 heads = 𝛿 𝑞 tails = 1 − 𝜇 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔

Three Coins Example H H T H T H H T H T T T If all flips were observed 𝑞 heads = 𝜔 𝑞 heads = 𝜇 𝑞 heads = 𝛿 𝑞 tails = 1 − 𝜇 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔 𝑞 heads = 4 𝑞 heads = 1 𝑞 heads = 1 6 4 2 𝑞 tails = 2 𝑞 tails = 3 𝑞 tails = 1 6 4 2

Three Coins Example H H T H T H H T H T T T But not all flips are observed → set parameter values 𝑞 heads = .6 𝑞 heads = 𝜇 = .6 𝑞 heads = .8 𝑞 tails = .4 𝑞 tails = .2 𝑞 tails = .4

Three Coins Example H H T H T H H T H T T T But not all flips are observed → set parameter values 𝑞 heads = .6 𝑞 heads = 𝜇 = .6 𝑞 heads = .8 𝑞 tails = .4 𝑞 tails = .2 𝑞 tails = .4 Use these values to compute posteriors 𝑞 heads | observed item H = 𝑞(heads & H) 𝑞(H) 𝑞 heads | observed item T = 𝑞(heads & T) 𝑞(T)

Three Coins Example H H T H T H H T H T T T But not all flips are observed → set parameter values 𝑞 heads = .6 𝑞 heads = 𝜇 = .6 𝑞 heads = .8 𝑞 tails = .4 𝑞 tails = .2 𝑞 tails = .4 Use these values to compute posteriors rewrite joint using Bayes rule 𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H) marginal likelihood

Three Coins Example H H T H T H H T H T T T But not all flips are observed → set parameter values 𝑞 heads = .6 𝑞 heads = 𝜇 = .6 𝑞 heads = .8 𝑞 tails = .4 𝑞 tails = .2 𝑞 tails = .4 Use these values to compute posteriors 𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H) 𝑞 H | heads = .8 𝑞 T | heads = .2

Three Coins Example H H T H T H H T H T T T But not all flips are observed → set parameter values 𝑞 heads = .6 𝑞 heads = 𝜇 = .6 𝑞 heads = .8 𝑞 tails = .4 𝑞 tails = .2 𝑞 tails = .4 Use these values to compute posteriors 𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H) 𝑞 H | heads = .8 𝑞 T | heads = .2 𝑞 H = 𝑞 H | heads ∗ 𝑞 heads + 𝑞 H | tails * 𝑞(tails) = .8 ∗ .6 + .6 ∗ .4

Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(T) 𝑞(H) .2 ∗ .6 .8 ∗ .6 = .2 ∗ .6 + .6 ∗ .4 ≈ 0.334 = .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(T) 𝑞(H) .2 ∗ .6 .8 ∗ .6 = .2 ∗ .6 + .6 ∗ .4 ≈ 0.334 = .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1? A: No.

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation - PowerPoint PPT Presentation

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Statistical Machine Learning Lecture 06 Extra: Expectation Maximization Kristian Kersting TU

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Maximization of Submodular Functions Seffi Naor Lecture 1 4th Cargese Workshop on Combinatorial

On the dual problem of utility maximization Yiqing LIN Joint work with L. GU and J. YANG

CSC304 Lecture 12 Mechanism Design w/ Money: Revenue maximization Myersons Auction CSC304 -

Variational denoising for manifold-valued data Andreas Weinmann Helmholtz Center Munich & TU

A Type System for Functional Traversal-Based Aspects Bryan Chadwick and Karl Lieberherr March 2

CS453 Intro and PA1 1 operator < If statement Low level pseudocode for result of translation

T w + C Minimize T z fo r some Z spae N 1 n 2 w n =1 K ( x , x ) =

Generalized Majorization-Minimization Sobhan Naderi Kun He Reza Aghajani Stan

Expectation Maximization [KF Chapter 19] CS 786 University of Waterloo Lecture 17: June 28,

and how to reverse it Almsgiving is Mammons perversion of giving. It affirms the superiority

Today Finish up Conditional Expectation. Markov Chains. Application: Mixing Each step, pick

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation - PowerPoint PPT Presentation

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Statistical Machine Learning Lecture 06 Extra: Expectation Maximization Kristian Kersting TU

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Maximization of Submodular Functions Seffi Naor Lecture 1 4th Cargese Workshop on Combinatorial

On the dual problem of utility maximization Yiqing LIN Joint work with L. GU and J. YANG

CSC304 Lecture 12 Mechanism Design w/ Money: Revenue maximization Myersons Auction CSC304 -

Variational denoising for manifold-valued data Andreas Weinmann Helmholtz Center Munich &amp; TU

A Type System for Functional Traversal-Based Aspects Bryan Chadwick and Karl Lieberherr March 2

CS453 Intro and PA1 1 operator &lt; If statement Low level pseudocode for result of translation

T w + C Minimize T z fo r some Z spae N 1 n 2 w n =1 K ( x , x ) =

Generalized Majorization-Minimization Sobhan Naderi Kun He Reza Aghajani Stan

Expectation Maximization [KF Chapter 19] CS 786 University of Waterloo Lecture 17: June 28,

and how to reverse it Almsgiving is Mammons perversion of giving. It affirms the superiority

Today Finish up Conditional Expectation. Markov Chains. Application: Mixing Each step, pick

Variational denoising for manifold-valued data Andreas Weinmann Helmholtz Center Munich & TU

CS453 Intro and PA1 1 operator < If statement Low level pseudocode for result of translation