Gaussian Mixture Models & EM CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

β–Ά
gaussian mixture models em
SMART_READER_LITE
LIVE PREVIEW

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Mixture Models: definition Mixture models: Linear supper-position of mixtures or components | =


slide-1
SLIDE 1

Gaussian Mixture Models & EM

CE-717: Machine Learning

Sharif University of Technology

  • M. Soleymani

Fall 2016

slide-2
SLIDE 2

Mixture Models: definition

2

 Mixture

models: Linear supper-position

  • f

mixtures

  • r

components

π‘ž π’š|𝜾 =

π‘˜=1 𝐿

𝑄(𝑁

π‘˜) π‘ž π’š 𝑁 π‘˜; πœΎπ‘˜

 π‘˜=1 𝐿

𝑄(𝑁

π‘˜) = 1  𝑄(𝑁 π‘˜): the prior probability of π‘˜-th mixture  πœΎπ‘˜: the parameters of π‘˜-th mixture  π‘ž π’š 𝑁 π‘˜; πœΎπ‘˜ : the probability of π’š according to π‘˜-th mixture

 Framework for finding more complex probability distributions

 Goal: estimate π‘ž π’š πœ„ E.g., Multi-modal density estimation

slide-3
SLIDE 3

Gaussian Mixture Models (GMMs)

3

 Gaussian Mixture Models: π‘ž π’š 𝑁

π‘˜; πœΎπ‘˜ ~𝑂(π‚π‘˜, πœ―π‘˜)

π‘ž π’š =

π‘˜=1 𝐿

πœŒπ‘˜π’ͺ(π’š|π‚π‘˜, πœ―π‘˜)

 Fitting the Gaussian mixture model

 Input: data points π’š 𝑗 𝑗=1 𝑂  Goal: find the parameters of GMM (πœŒπ‘˜, π‚π‘˜, πœ―π‘˜, π‘˜ = 1, … , 𝐿)

0 ≀ πœŒπ‘˜ ≀ 1

π‘˜=1 𝐿

πœŒπ‘˜ = 1

slide-4
SLIDE 4

GMM: 1-D Example

4

2

1

ο€­ ο€½  4

2 ο€½



𝜌1 = 0.6 𝜌2 = 0.3 𝜌3 = 0.1

2

1 ο€½



1

2 ο€½

 8

3 ο€½

 2 .

3 ο€½



slide-5
SLIDE 5

GMM: 2-D Example

5

k = 3

𝝂1 = βˆ’2 3 Ξ£1 = 1 0.5 0.5 4 𝜌1 = 0.6 𝝂2 = 0 βˆ’4 Ξ£2 = 1 1 𝜌2 = 0.25 𝝂3 = 3 2 Ξ£3 = 3 1 1 1 𝜌3 = 0.15

slide-6
SLIDE 6

GMM: 2-D Example

6

k = 3

𝝂1 = βˆ’2 3 Ξ£1 = 1 0.5 0.5 4 𝜌1 = 0.6 𝝂2 = 0 βˆ’4 Ξ£2 = 1 1 𝜌2 = 0.25 𝝂3 = 3 2 Ξ£3 = 3 1 1 1 𝜌3 = 0.15

 GMM distribution

slide-7
SLIDE 7

How to Fit GMM?

7

 In order to maximize log likelihood:

ln π‘ž 𝒀 𝝆, 𝝂, 𝜯 =

𝑗=1 𝑂

ln

π‘˜=1 𝑙

πœŒπ‘˜π’ͺ(π’š|π‚π‘˜, πœ―π‘˜)

 The sum over components appears inside the log and there is no closed

form solution for maximum likelihood.

πœ– ln π‘ž 𝒀 𝝆, 𝝂, 𝜯 πœ–π‚π‘™ = 𝟏 πœ– ln π‘ž 𝒀 𝝆, 𝝂, 𝜯 πœ–πœ―π‘™ = 𝟏 πœ– ln π‘ž 𝒀 𝝆, 𝝂, 𝜯 + πœ‡ π‘˜=1

𝐿

πœŒπ‘˜ βˆ’ 1 πœ–πœŒπ‘™ = 0

𝑙 = 1, … , 𝐿 𝒀 = π’š(1), … , π’š(𝑂)

slide-8
SLIDE 8

ML for GMM

8

𝝂𝑙 = 1 𝑂𝑙

𝑗=1 𝑂

πœŒπ‘™π’ͺ(π’š(𝑗)|𝝂𝑙, πœ―π‘™) π‘˜=1

𝐿

πœŒπ‘˜π’ͺ(π’š(𝑗)|π‚π‘˜, πœ―π‘˜) π’š(𝑗) πœ―π‘™ = 1 𝑂𝑙

𝑗=1 𝑂

πœŒπ‘™π’ͺ(π’š(𝑗)|𝝂𝑙, πœ―π‘™) π‘˜=1

𝐿

πœŒπ‘˜π’ͺ(π’š(𝑗)|π‚π‘˜, πœ―π‘˜) (π’š(𝑗)βˆ’π‚π‘™

new)(π’š 𝑗 βˆ’π‚π‘™ new)π‘ˆ

πœŒπ‘™

new = 𝑂𝑙

𝑂 𝑂𝑙 =

𝑗=1 𝑂

πœŒπ‘™π’ͺ(π’š(𝑗)|𝝂𝑙, πœ―π‘™) π‘˜=1

𝐿

πœŒπ‘˜π’ͺ(π’š(𝑗)|π‚π‘˜, πœ―π‘˜)

πœ– log π‘©βˆ’1 πœ–π‘©βˆ’1 = π‘©π‘ˆ πœ–π’šπ‘ˆπ‘©π’š πœ–π‘© = π’šπ’šπ‘ˆ

slide-9
SLIDE 9

EM algorithm

9

 An

iterative algorithm in which each iteration is guaranteed to improve the log-likelihood function

 General algorithm for finding ML estimation when the

data is incomplete (missing or unobserved data).

 EM find the maximum likelihood parameters in cases where

the models involve unobserved variables π‘Ž in addition to unknown parameters 𝜾 and known data observations π‘Œ.

slide-10
SLIDE 10

Mixture models: discrete latent variables

10

π‘ž(π’š) = 𝑄 𝑨

π‘˜ = 1 π‘ž π’š 𝑨 π‘˜ = 1 = π‘˜=1 𝐿

πœŒπ‘˜ π‘ž π’š 𝑨

π‘˜ = 1

 𝑨: latent or hidden variable

 specifies the mixture component

 𝑄 π‘¨π‘˜ = 1 = πœŒπ‘˜

 0 ≀ πœŒπ‘˜ ≀ 1  π‘˜=1

𝐿

πœŒπ‘˜ = 1

slide-11
SLIDE 11

EM for GMM

11

 Initialize 𝝂𝑙, πœ―π‘™, πœŒπ‘™

𝑙 = 1, … , 𝐿

 E step: 𝑗 = 1, … , 𝑂, π‘˜ = 1, … , 𝐿

π›Ώπ‘˜

𝑗 = 𝑄 𝑨 π‘˜ (𝑗) = 1|π’š 𝑗 , πœΎπ‘π‘šπ‘’

= πœŒπ‘˜

π‘π‘šπ‘’π’ͺ(π’š 𝑗 |π‚π‘˜ π‘π‘šπ‘’, πœ―π‘˜ π‘π‘šπ‘’)

𝑙=1

𝐿

πœŒπ‘™

π‘π‘šπ‘’π’ͺ(π’š(𝑗)|𝝂𝑙 π‘π‘šπ‘’, πœ―π‘™ π‘π‘šπ‘’)

 M Step: π‘˜ = 1, … , 𝐿

π‚π‘˜

π‘œπ‘“π‘₯ =

𝑗=1

𝑂

π›Ώπ‘˜

π‘—π’š(𝑗)

𝑗=1

𝑂

π›Ώπ‘˜

𝑗

πœ―π‘˜

π‘œπ‘“π‘₯ =

1 𝑗=1

𝑂

π›Ώπ‘˜

𝑗 𝑗=1 𝑂

π›Ώπ‘˜

𝑗(π’š(𝑗)βˆ’π‚π‘˜ new)(π’š 𝑗 βˆ’π‚π‘˜ new)π‘ˆ

πœŒπ‘˜

new =

𝑗=1

𝑂

π›Ώπ‘˜

𝑗

𝑂

 Repeat E and M steps until convergence

𝜾 = [𝝆, 𝝂, 𝜯] 𝑨(𝑗) ∈ {1,2, … ,𝐿} shows the mixture from which 𝑦(𝑗) is generated

slide-12
SLIDE 12

EM & GMM: example

12

[Bishop]

slide-13
SLIDE 13

EM & GMM: Example

13

[Bishop]

slide-14
SLIDE 14

Local Minima

14

slide-15
SLIDE 15

Local Minima

15

𝝂1 = βˆ’2 3 Ξ£1 = 1 0.5 0.5 4 𝜌1 = 0.6 𝝂2 = 0 βˆ’4 Ξ£2 = 1 1 𝜌2 = 0.25 𝝂3 = 3 2 Ξ£3 = 3 1 1 1 𝜌3 = 0.15

𝝂1 = 0.36 βˆ’4.09 Ξ£1 = 0.89 0.26 0.26 0.83 𝜌1 = 0.249 𝝂2 = 3.25 2.09 Ξ£2 = 2.23 1.08 1.09 1.41 𝜌2 = 0.146 𝝂3 = βˆ’2.11 3.36 Ξ£3 = 1.12 0.61 0.61 3.61 𝜌3 = 0.604 𝝂1 = 1.45 βˆ’1.81 Ξ£1 = 3.30 4.76 4.76 10.01 𝜌1 = 0.392 𝝂2 = βˆ’2.20 3.16 Ξ£2 = 1.30 1.10 1.10 2.80 𝜌2 = 0.429 𝝂3 = βˆ’1.88 3.74 Ξ£3 = 5.83 βˆ’0.82 βˆ’0.82 5.83 𝜌3 = 0.178

𝐷1 𝐷2 𝐷3 𝐷1 𝐷2 𝐷3

slide-16
SLIDE 16

EM+GMM vs. k-means

16

 k-means:

 It is not probabilistic  Has fewer parameters (and faster)  Limited by the underlying assumption of spherical clusters

 can be extended to use covariance – get β€œhard EM” (ellipsoidal k-

means).  Both EM and k-means depend on initialization

 getting stuck in local optima

 EM+GMM has more local minima  Useful trick: first run k-means and then use its result to initialize EM.

slide-17
SLIDE 17

EM algorithm: general

General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).

slide-18
SLIDE 18

Incomplete log likelihood

18

 Complete log likelihood

 Maximizing likelihood (i.e., log 𝑄(π‘Œ, 𝑍|𝜾)) for labeled data is

straightforward

 Incomplete log likelihood

 With π‘Ž unobserved, our objective becomes the log of a

marginal probability log 𝑄(π‘Œ|𝜾) = log π‘Ž 𝑄(π‘Œ, π‘Ž|𝜾)

 This objective will not decouple and we use EM algorithm to solve it

slide-19
SLIDE 19

EM Algorithm

19

 Assumptions: π‘Œ (observed or known variables), π‘Ž (unobserved or latent

variables), π‘Œ come from a specific model with unknown parameters 𝜾

 If π‘Ž is relevant to π‘Œ (in any way), we can hope to extract information about it

from π‘Œ assuming a specific parametric model on the data.  Steps:

 Initialization: Initialize the unknown parameters 𝜾  Iterate the following steps, until convergence:

 Expectation step: Find the probability of unobserved variables given the current

parameters estimates and the observed data.

 Maximization step: from the observed data and the probability of the

unobserved data find the most likely parameters (a better estimate for the parameters).

slide-20
SLIDE 20

EM algorithm intuition

20

 When learning with hidden variables, we are trying to solve

two problems at once:

 hypothesizing values for the unobserved variables in each data sample  learning the parameters

 Each of these tasks is fairly easy when we have the solution to

the other.

 Given complete data, we have the statistics, and we can estimate

parameters using the MLE formulas.

 Conversely, computing probability of missing data given the parameters is

a probabilistic inference problem

slide-21
SLIDE 21

EM algorithm

21

slide-22
SLIDE 22

EM theoretical analysis

22

 What is the underlying theory for the use of the

expected complete log likelihood in the M-step? 𝐹𝑄 π‘Ž π‘Œ, πœΎπ‘π‘šπ‘’ log 𝑄 π‘Œ, π‘Ž 𝜾

 Now, we

show that maximizing this function also maximizes the likelihood

slide-23
SLIDE 23

EM theoretical foundation: Objective function

23

π‘Ž

slide-24
SLIDE 24

Jensen’s inequality

24

slide-25
SLIDE 25

EM theoretical foundation: Algorithm in general form

25

slide-26
SLIDE 26

EM theoretical foundation: E-step

26

𝑅𝑒 = 𝑄(π‘Ž|π‘Œ, πœΎπ‘’) ⟹ 𝑅𝑒 = argmax

𝑅

𝐺 πœΎπ‘’, 𝑅 Proof: 𝐺 πœΎπ‘’, 𝑄(π‘Ž|π‘Œ, πœΎπ‘’) =

π‘Ž

𝑄(π‘Ž|π‘Œ, πœΎπ‘’) log 𝑄(π‘Œ, π‘Ž|πœΎπ‘’) 𝑄(π‘Ž|π‘Œ, πœΎπ‘’) =

π‘Ž

𝑄(π‘Ž|π‘Œ, πœΎπ‘’) log 𝑄(π‘Œ|πœΎπ‘’) = log 𝑄(π‘Œ|πœΎπ‘’) = β„“ πœΎπ‘’; π‘Œ

 𝐺 𝜾, 𝑅

is a lower bound on β„“ 𝜾; π‘Œ . Thus, 𝐺 πœΎπ‘’, 𝑅 has been maximized by setting 𝑅 to 𝑄 π‘Ž π‘Œ, πœΎπ‘’ : 𝐺 πœΎπ‘’, 𝑄(π‘Ž|π‘Œ, πœΎπ‘’) = β„“ πœΎπ‘’; π‘Œ β‡’ 𝑄 π‘Ž π‘Œ, πœΎπ‘’ = argmax

𝑅

𝐺 πœΎπ‘’, 𝑅

slide-27
SLIDE 27

EM algorithm: illustration

27

β„“ 𝜾; π‘Œ 𝐺 𝜾, 𝑅𝑒 πœΎπ‘’ πœΎπ‘’+1

slide-28
SLIDE 28

EM theoretical foundation: M-step

28

M-step can be equivalently viewed as maximizing the expected complete log-likelihood: πœΎπ‘’+1 = argmax

𝜾

𝐺 𝜾, 𝑅𝑒 = argmax

𝜾

𝐹𝑅𝑒 log 𝑄(π‘Œ, π‘Ž|𝜾) Proof:

𝐺 𝜾, 𝑅𝑒 =

π‘Ž

𝑅𝑒(π‘Ž) log 𝑄(π‘Œ, π‘Ž|𝜾) 𝑅𝑒(π‘Ž) =

π‘Ž

𝑅𝑒(π‘Ž) log 𝑄(π‘Œ, π‘Ž|𝜾) βˆ’

π‘Ž

𝑅𝑒(π‘Ž) log 𝑅𝑒(π‘Ž) β‡’ 𝐺 𝜾, 𝑅𝑒 = 𝐹𝑅𝑒 log 𝑄(π‘Œ, π‘Ž|𝜾) + 𝐼(𝑅𝑒 π‘Ž )

Independent of 𝜾

slide-29
SLIDE 29

EM iteration increases β„“ 𝜾; π‘Œ

29

β„“ πœΎπ‘’; π‘Œ = 𝐹𝑅𝑒 log 𝑄 π‘Œ, π‘Ž πœΎπ‘’ + 𝐼(𝑅𝑒 π‘Ž ) β„“ πœΎπ‘’+1; π‘Œ β‰₯ 𝐹𝑅𝑒 log 𝑄 π‘Œ, π‘Ž πœΎπ‘’+1 + 𝐼(𝑅𝑒 π‘Ž ) β„“ πœΎπ‘’+1; π‘Œ βˆ’ β„“ πœΎπ‘’; π‘Œ β‰₯ 𝐹𝑅𝑒 log 𝑄 π‘Œ, π‘Ž πœΎπ‘’+1 βˆ’ 𝐹𝑅𝑒 log 𝑄 π‘Œ, π‘Ž πœΎπ‘’ Moreover, we have: πœΎπ‘’+1 = argmax

𝜾

𝐹𝑅𝑒 log 𝑄 π‘Œ, π‘Ž 𝜾 β‡’ 𝐹𝑅𝑒 log 𝑄 π‘Œ, π‘Ž πœΎπ‘’+1 β‰₯ 𝐹𝑅𝑒 log 𝑄 π‘Œ, π‘Ž πœΎπ‘’ β‡’ β„“ πœΎπ‘’+1; π‘Œ βˆ’ β„“ πœΎπ‘’; π‘Œ β‰₯ 0 EM is guaranteed to find a local maxima of the log likelihood

slide-30
SLIDE 30

30

slide-31
SLIDE 31

EM for GMM M step: details

31

π‘ž π‘Œ, π‘Ž 𝜾 =

𝑗=1 𝑂

π‘ž(π’š 𝑗 , π’œ 𝑗 |𝜾) =

𝑗=1 𝑂

π‘ž(π’š 𝑗 |π’œ 𝑗 , 𝜾)π‘ž(π’œ 𝑗 |𝝆) =

𝑗=1 𝑂 π‘˜=1 𝐿

π’ͺ π’š 𝑗 π‚π‘˜ , πœ―π‘˜

π‘¨π‘˜

(𝑗)

πœŒπ‘˜

π‘¨π‘˜

(𝑗)

log π‘ž π‘Œ, π‘Ž 𝜾 =

𝑗=1 𝑂 π‘˜=1 𝐿

𝑨

π‘˜ (𝑗) log π’ͺ π’š 𝑗 π‚π‘˜ , πœ―π‘˜

+ log πœŒπ‘˜ πΉπ‘Ž~𝑄 π‘Ž π‘Œ, 𝜾old log π‘ž π‘Œ, π‘Ž 𝜾 = =

𝑗=1 𝑂 π‘˜=1 𝐿

𝐹𝑄 π‘¨π‘˜

𝑗 |π’š 𝑗 ,𝜾old

𝑨

π‘˜ 𝑗

log π’ͺ π’š 𝑗 π‚π‘˜ , πœ―π‘˜ + log πœŒπ‘˜

𝜾 = [𝝆, 𝝂, 𝜯] πœΎπ‘π‘šπ‘’ = [π†π‘π‘šπ‘’, π‚π‘π‘šπ‘’, 𝜯𝐩𝐦𝐞] π›Ώπ‘˜

𝑗

slide-32
SLIDE 32

EM for GMM M step: details

32

πœ–π‘… 𝜾; πœΎπ’‘π’Žπ’† πœ–π‚π‘˜ = 0 β‡’ π‚π‘˜ = 𝑗=1

𝑂

π›Ώπ‘˜

π‘—π’š(𝑗)

𝑗=1

𝑂

π›Ώπ‘˜

𝑗

πœ–π‘… 𝜾; πœΎπ’‘π’Žπ’† πœ–πœ―π‘˜ = 0 β‡’ πœ―π‘˜ = 1 𝑗=1

𝑂

π›Ώπ‘˜

𝑗 𝑗=1 𝑂

π›Ώπ‘˜

𝑗(π’š(𝑗)βˆ’π‚π‘˜ )(π’š 𝑗 βˆ’π‚π‘˜ )π‘ˆ

πœ– 𝑅 𝜾; πœΎπ’‘π’Žπ’† + πœ‡ π‘š=1

𝑙

πœŒπ‘š βˆ’ 1 πœ–πœŒπ‘˜ = 0 β‡’ πœŒπ‘˜ = 𝑗=1

𝑂

π›Ώπ‘˜

𝑗

𝑂

Lagrange multiplier due to the constraint π‘˜=1

𝑙

πœŒπ‘˜ = 1

slide-33
SLIDE 33

EM algorithm: general

33

 EM: general procedure for learning from partly observed

data

 Define: Q(𝜾; 𝜾old) = πΉπ‘Ž~𝑄(π‘Ž|π‘Œ,𝜾old) log π‘ž(π‘Œ, π‘Ž|𝜾) 

= π‘Ž 𝑄(π‘Ž|π‘Œ, 𝜾old) Γ— log π‘ž(π‘Œ, π‘Ž|𝜾)

Choose an initial setting 𝜾old = 𝜾0 Iterate until convergence: E Step: Use π‘Œ and current 𝜾old to calculate 𝑄(π‘Ž|π‘Œ, 𝜾old) M Step: 𝜾new = argmax

𝜾

Q(𝜾; 𝜾old) 𝜾old ← 𝜾new

expectation of the log-likelihood evaluated using the current estimate for the parameters 𝜾old

slide-34
SLIDE 34

EM advantages and disadvantages

34

 Some good things about EM:

 no learning rate (step-size) parameter  automatically enforces parameter constraints  very fast for low dimensions  each iteration guaranteed to improve likelihood

 Some bad things about EM:

 can be slower than some other iterative gradient-based

methods

slide-35
SLIDE 35

Semi-supervised learning

35

 Supervised Learning models require labeled data

 Supervised learning usually requires plenty of labeled data

 It is usually expensive to have a large set of labeled data  Unlabeled data is often abundant with no or low cost

 Learning from both labeled and unlabeled data

 Labeled training data: β„’ =

π’š π‘œ , 𝑧 π‘œ

π‘š=1 𝑀

 Unlabeled data available during training: 𝒱 = π’š π‘œ

π‘œ=𝑀+1 𝑀+𝑉

slide-36
SLIDE 36

Semi-supervised learning: example

36

Zhu, Semi-Supervised Learning Tutorial, ICML, 2007.

slide-37
SLIDE 37

37

Zhu, Semi-Supervised Learning Tutorial, ICML, 2007.

slide-38
SLIDE 38

Semi-supervised generative model

38

 Start from MLE 𝜾 = 𝝆, 𝝂, 𝜯 on β„’ =

π’š π‘œ , 𝑧 π‘œ

π‘š=1 𝑀

 Repeat:

 E-step: compute π‘ž(𝑧 π‘œ |π’š(π‘œ), 𝜾) for π‘œ = 𝑀 + 1 to π‘œ = 𝑀 + 𝑉  M-step: compute the parameters 𝜾 = 𝝆, 𝝂, 𝜯

considering both labeled data and unlabeled data using the distribution found on their labels in the E-step

slide-39
SLIDE 39

Resource

39

 C. Bishop, β€œPattern Recognition and Machine Learning”,

Chapter 9.