Expectation Maximization, and Learning from Partly Unobserved Data - PowerPoint PPT Presentation

Expectation Maximization, and Learning from Partly Unobserved Data Recommended readings: • Mitchell, Chapter 6.12 • "Text Classification from Labeled and Unlabeled Documents using EM", K.Nigam, et al., 2000. Machine Learning, 39. http://www.cs.cmu.edu/%7Eknigam/papers/emcat-mlj99.ps Machine Learning 10-701 November 11, 2005 Tom M. Mitchell Carnegie Mellon University

Outline • EM 1 : Learning Bayes network CPT’s from partly unobserved data • EM 2 : Mixture of Gaussians – clustering • EM: the general story • Text application: learning Naïve Bayes classifier from labeled and unlabeled data

1. Learning Bayes net parameters from partly unobserved data

Learning CPTs from Fully Observed Data • Example: Consider Flu Allergy learning the parameter Sinus • MLE (Max Likelihood Nose Headache Estimate) is k th training example • Remember why?

MLE estimate of from fully observed data • Maximum likelihood estimate Flu Allergy Sinus • Our case: Nose Headache

Estimate from partly observed data Flu Allergy • What if FAHN observed, but not S? Sinus • Can’t calculate MLE Nose Headache • Let X be all observed variable values (over all examples) • Let Z be all unobserved variable values • Can’t calculate MLE: • EM seeks estimate:

Flu Allergy • EM seeks estimate: Sinus Nose Headache • here, observed X={F,A,H,N}, unobserved Z={S}

EM Algorithm EM is a general procedure for solving such problems Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S}) Define Iterate until convergence: • E Step: Use X and current θ to estimate P(Z|X, θ ) • M Step: Replace current θ by Guaranteed to find local maximum. Each iteration increases

E Step: Use X, θ , to Calculate P(Z|X, θ ) Flu Allergy Sinus Nose Headache • How? Bayes net inference problem.

M step: modify this to achieve • Maximum likelihood estimate Flu Allergy Sinus • Our case: Nose Headache

Flu EM and estimating Allergy Sinus observed X = {F,A,H,N}, unobserved Z={S}) Nose Headache E step: Calculate for each training example, k M step: Recall MLE was:

Flu Allergy EM and estimating Sinus More generally, Nose Headache Given observed set X, unobserved set Z of boolean values E step: Calculate for each training example, k the expected value of each unobserved variable M step: Calculate estimates similar to MLE, but replacing each count by its expected count

2. Usupervised clustering: K-means and Mixtures of Gaussians

Clustering • Given set of data points, group them • Unsupervised learning • Which patients are similar? (or which earthquakes, customers, faces, web pages, …)

K-means Clustering Given data <x 1 … x n >, and K, assign each x i to one of K clusters, C 1 … C K , minimizing Where is mean over all points in cluster C j K-Means Algorithm: Initialize randomly Repeat until convergence: 1. Assign each point x i to the cluster with the closest mean μ j 2. Calculate the new mean for each cluster

K Means Applet • Run K-means applet – http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html • Try 3 clusters, 15 pts

Mixtures of Gaussians K-means is EM’ish, but makes ‘hard’ assignments of x i to clusters. Let’s derive a real EM algorithm for clustering. What object function shall we optimize? • Maximize data likelihood! What form of P(X) should we assume? • Mixture of Gaussians Mixture of Gaussians: • Assume P(x) is a mixture of K different Gaussians • Then each data point, x , is generated by 2-step process z � choose one of the K Gaussians, according to π 1 … π K-1 1. Generate x according to the Gaussian N( μ z , Σ z ) 2.

Mixture Distributions P(X| φ ) is a “mixture” of K different distributions: • P 1 (X| θ 1 ), P 2 (X| θ 2 ), ... P K (X| θ Κ ) where φ = < θ 1 ... θ K , π 1 ... π K-1 > We generate a draw X ~ P(X| φ ) in two steps: • Choose Z ∈ {1, ... K} according to P( Z | π 1 ...π K-1 ) 1. Generate X ~ P k (X| θ k ) 2.

EM for Mixture of Gaussians Simplify to make this easier: 1. assume X=<X 1 ... X n >, and the X i are conditionally independent given Z . 2. assume only 2 mixture components, and Z Assume σ known, π 1 … π K, μ 1i … μ Ki unknown 3. Observed: X=<X 1 ... X n > Unobserved: Z X 1 X 2 X 3 X 4

Z EM Given observed variables X, unobserved Z Define where X 1 X 2 X 3 X 4 Iterate until convergence: • E Step: Calculate P(Z(n)|X(n), θ ) for each example X(n). Use this to construct • M Step: Replace current θ by

Z EM – E Step Calculate P(Z(n)|X(n), θ ) for each observed example X(n) X(n)=<x 1 (n), x 2 (n), … x T (n)>. X 1 X 2 X 3 X 4

X 4 X 3 Z X 2 X 1 EM – M Step π ’ has no influence Count z(n)=1 First consider update for π

EM – M Step Z Now consider update for μ ji μ ji ’ has no influence X 1 X 2 X 3 X 4 … … … Compare above to MLE if Z were observable:

Z EM – putting it together Given observed variables X, unobserved Z Define where X 1 X 2 X 3 X 4 Iterate until convergence: • E Step: For each observed example X(n), calculate P(Z(n)|X(n), θ ) • M Step: Update

Mixture of Gaussians applet • Run applet http://www.neurosci.aist.go.jp/%7Eakaho/MixtureEM.html

K-Means vs Mixture of Gaussians • Both are iterative algorithms to assign points to clusters • Objective function – K Means: minimize – MixGaussians: maximize P(X| θ ) • Mixture of Gaussians is the more general formulation – Equivalent to K Means when Σ k = σ I , and σ → 0

Using Unlabeled Data to Help Train Naïve Bayes Classifier Learn P(Y|X) Y Y X1 X2 X3 X4 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 ? 0 1 1 0 X 1 X 2 X 3 X 4 ? 0 1 0 1

From [Nigam et al., 2000]

w t is t-th word in vocabulary M Step: E Step:

Elaboration 1: Downweight the influence of unlabeled examples by factor λ Chosen by cross validation New M step:

Using one labeled example per class

Experimental Evaluation • Newsgroup postings – 20 newsgroups, 1000/group • Web page classification – student, faculty, course, project – 4199 web pages • Reuters newswire articles – 12,902 articles – 90 topics categories

20 Newsgroups

What you should know about EM • For learning from partly unobserved data • MLEst of θ = • EM estimate: θ = Where X is observed part of data, Z is unobserved • EM for training Bayes networks • Can also develop MAP version of EM • Be able to derive your own EM algorithm for your own problem

Combining Labeled and Unlabeled Data How else can unlabeled data be useful for supervised learning/function approximation?

Combining Labeled and Unlabeled Data How can unlabeled data {x} be useful for learning f: X � Y 1. Using EM, if we know the form of P(Y|X) 2. By letting us estimate P(X) and reweight labeled examples 3. Co-Training [Blum & Mitchell, 1998] 4. To detect overfitting [Schuurmans, 2002]

Expectation Maximization, and Learning from Partly Unobserved Data - PowerPoint PPT Presentation

Expectation Maximization, and Learning from Partly Unobserved Data Recommended readings: Mitchell, Chapter 6.12 "Text Classification from Labeled and Unlabeled Documents using EM", K.Nigam, et al., 2000. Machine Learning, 39.

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

Statistical Machine Learning Lecture 06 Extra: Expectation Maximization Kristian Kersting TU

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Expectation maximization Subhransu Maji CMPSCI 689: Machine Learning 14 April 2015 Motivation

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra

S9226 Fast singular value decomposition on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Samuel

A study dy of Jumper r FIV due to multip iphas hase intern rnal al flow: w: under erst

Seismic Assessment & Retrofitting of Existing RC Structures Using SeismoBuild and

Unconstrained Elastic Matching Unconstrained Elastic Matching and Eigen Eigen- -Deformations

Adopting Semi-supervised Learning Algorithms for Mining Remote Sensing Imagery: Summary of Results

Empirical Bayes Newton Method Bayesian Linear Models MAP Learning Will Penny MEG Source

Conducting Patient-Centered Outcomes Research (PCOR) Applicant Town Hall January 30, 2018

OMOP and Mini-Sentinel Collaborations Supporting Routine Prospective Surveillance Jennifer Nelson

Expectation Maximization, and Learning from Partly Unobserved Data - PowerPoint PPT Presentation

Expectation Maximization, and Learning from Partly Unobserved Data Recommended readings: Mitchell, Chapter 6.12 "Text Classification from Labeled and Unlabeled Documents using EM", K.Nigam, et al., 2000. Machine Learning, 39.

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

Statistical Machine Learning Lecture 06 Extra: Expectation Maximization Kristian Kersting TU

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Expectation maximization Subhransu Maji CMPSCI 689: Machine Learning 14 April 2015 Motivation

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra

S9226 Fast singular value decomposition on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Samuel

A study dy of Jumper r FIV due to multip iphas hase intern rnal al flow: w: under erst

Seismic Assessment &amp; Retrofitting of Existing RC Structures Using SeismoBuild and

Unconstrained Elastic Matching Unconstrained Elastic Matching and Eigen Eigen- -Deformations

Adopting Semi-supervised Learning Algorithms for Mining Remote Sensing Imagery: Summary of Results

Empirical Bayes Newton Method Bayesian Linear Models MAP Learning Will Penny MEG Source

Conducting Patient-Centered Outcomes Research (PCOR) Applicant Town Hall January 30, 2018

OMOP and Mini-Sentinel Collaborations Supporting Routine Prospective Surveillance Jennifer Nelson

Seismic Assessment & Retrofitting of Existing RC Structures Using SeismoBuild and