parametric models part ii expectation maximization and
play

Parametric Models Part II: Expectation-Maximization and Mixture - PowerPoint PPT Presentation

Parametric Models Part II: Expectation-Maximization and Mixture Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2019 CS 551, Spring 2019 2019, Selim Aksoy


  1. Parametric Models Part II: Expectation-Maximization and Mixture Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2019 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 1 / 32

  2. Missing Features ◮ Suppose that we have a Bayesian classifier that uses the feature vector x but a subset x g of x are observed and the values for the remaining features x b are missing. ◮ How can we make a decision? ◮ Throw away the observations with missing values. ◮ Or, substitute x b by their average ¯ x b in the training data, and use x = ( x g , ¯ x b ) . ◮ Or, marginalize the posterior over the missing features, and use the resulting posterior � P ( w i | x g , x b ) p ( x g , x b ) d x b P ( w i | x g ) = . � p ( x g , x b ) d x b CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 2 / 32

  3. Expectation-Maximization ◮ We can also extend maximum likelihood techniques to allow learning of parameters when some training patterns have missing features. ◮ The Expectation-Maximization (EM) algorithm is a general iterative method of finding the maximum likelihood estimates of the parameters of a distribution from training data. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 3 / 32

  4. Expectation-Maximization ◮ There are two main applications of the EM algorithm: ◮ Learning when the data is incomplete or has missing values. ◮ Optimizing a likelihood function that is analytically intractable but can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters. ◮ The second problem is more common in pattern recognition applications. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 4 / 32

  5. Expectation-Maximization ◮ Assume that the observed data X is generated by some distribution. ◮ Assume that a complete dataset Z = ( X , Y ) exists as a combination of the observed but incomplete data X and the missing data Y . ◮ The observations in Z are assumed to be i.i.d. from the joint density p ( z | Θ ) = p ( x , y | Θ ) = p ( y | x , Θ ) p ( x | Θ ) . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 5 / 32

  6. Expectation-Maximization ◮ We can define a new likelihood function L ( Θ |Z ) = L ( Θ |X , Y ) = p ( X , Y| Θ ) called the complete-data likelihood where L ( Θ |X ) is referred to as the incomplete-data likelihood. ◮ The EM algorithm: ◮ First, finds the expected value of the complete-data log-likelihood using the current parameter estimates (expectation step). ◮ Then, maximizes this expectation (maximization step). CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 6 / 32

  7. Expectation-Maximization ◮ Define Q ( Θ , Θ ( i − 1) ) = E log p ( X , Y| Θ ) | X , Θ ( i − 1) � � as the expected value of the complete-data log-likelihood w.r.t. the unknown data Y given the observed data X and the current parameter estimates Θ ( i − 1) . ◮ The expected value can be computed as � log p ( X , Y| Θ ) |X , Θ ( i − 1) � log p ( X , y | Θ ) p ( y |X , Θ ( i − 1) ) d y . � E = ◮ This is called the E-step . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 7 / 32

  8. Expectation-Maximization ◮ Then, the expectation can be maximized by finding optimum values for the new parameters Θ as Θ ( i ) = arg max Θ Q ( Θ , Θ ( i − 1) ) . ◮ This is called the M-step . ◮ These two steps are repeated iteratively where each iteration is guaranteed to increase the log-likelihood. ◮ The EM algorithm is also guaranteed to converge to a local maximum of the likelihood function. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 8 / 32

  9. Mixture Densities ◮ A mixture model is a linear combination of m densities m � p ( x | Θ ) = α j p j ( x | θ j ) j =1 where Θ = ( α 1 , . . . , α m , θ 1 , . . . , θ m ) such that α j ≥ 0 and � m j =1 α j = 1 . ◮ α 1 , . . . , α m are called the mixing parameters. ◮ p j ( x | θ j ) , j = 1 , . . . , m are called the component densities. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 9 / 32

  10. Mixture Densities ◮ Suppose that X = { x 1 , . . . , x n } is a set of observations i.i.d. with distribution p ( x | Θ ) . ◮ The log-likelihood function of Θ becomes n n m � � � � � log L ( Θ |X ) = log p ( x i | Θ ) = log α j p j ( x i | θ j ) . i =1 i =1 j =1 ◮ We cannot obtain an analytical solution for Θ by simply setting the derivatives of log L ( Θ |X ) to zero because of the logarithm of the sum. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 10 / 32

  11. Mixture Density Estimation via EM ◮ Consider X as incomplete and define hidden variables Y = { y i } n i =1 where y i corresponds to which mixture component generated the data vector x i . ◮ In other words, y i = j if the i ’th data vector was generated by the j ’th mixture component. ◮ Then, the log-likelihood becomes log L ( Θ |X , Y ) = log p ( X , Y| Θ ) n � = log( p ( x i | y i , θ i ) p ( y i | θ i )) i =1 n � = log( α y i p y i ( x i | θ y i )) . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 11 / 32

  12. Mixture Density Estimation via EM ◮ Assume we have the initial parameter estimates Θ ( g ) = ( α ( g ) 1 , . . . , α ( g ) m , θ ( g ) 1 , . . . , θ ( g ) m ) . ◮ Compute p ( y i | x i , Θ ( g ) ) = α ( g ) α ( g ) y i p y i ( x i | θ ( g ) y i p y i ( x i | θ ( g ) y i ) y i ) = p ( x i | Θ ( g ) ) � m j =1 α ( g ) j p j ( x i | θ ( g ) j ) and n � p ( Y|X , Θ ( g ) ) = p ( y i | x i , Θ ( g ) ) . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 12 / 32

  13. Mixture Density Estimation via EM ◮ Then, Q ( Θ , Θ ( g ) ) takes the form Q ( Θ , Θ ( g ) ) = � log p ( X , y | Θ ) p ( y |X , Θ ( g ) ) y m n � � log( α j p j ( x i | θ j )) p ( j | x i , Θ ( g ) ) = j =1 i =1 m n � � log( α j ) p ( j | x i , Θ ( g ) ) = j =1 i =1 m n � � log( p j ( x i | θ j )) p ( j | x i , Θ ( g ) ) . + j =1 i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 13 / 32

  14. Mixture Density Estimation via EM ◮ We can maximize the two sets of summations for α j and θ j independently because they are not related. ◮ The estimate for α j can be computed as n α j = 1 � p ( j | x i , Θ ( g ) ) ˆ n i =1 where α ( g ) j p j ( x i | θ ( g ) j ) p ( j | x i , Θ ( g ) ) = . � m t =1 α ( g ) t p t ( x i | θ ( g ) t ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 14 / 32

  15. Mixture of Gaussians ◮ We can obtain analytical expressions for θ j for the special case of a Gaussian mixture where θ j = ( µ j , Σ j ) and p j ( x | θ j ) = p j ( x | µ j , Σ j ) 1 � − 1 � 2( x − µ j ) T Σ − 1 = (2 π ) d/ 2 | Σ j | 1 / 2 exp j ( x − µ j ) . ◮ Equating the partial derivative of Q ( Θ , Θ ( g ) ) with respect to µ j to zero gives � n i =1 p ( j | x i , Θ ( g ) ) x i µ j = ˆ . � n i =1 p ( j | x i , Θ ( g ) ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 15 / 32

  16. Mixture of Gaussians ◮ We consider five models for the covariance matrix Σ j : ◮ Σ j = σ 2 I m n σ 2 = 1 � � p ( j | x i , Θ ( g ) ) � x i − ˆ µ j � 2 ˆ nd j =1 i =1 ◮ Σ j = σ 2 j I � n i =1 p ( j | x i , Θ ( g ) ) � x i − ˆ µ j � 2 σ 2 ˆ j = d � n i =1 p ( j | x i , Θ ( g ) ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 16 / 32

  17. Mixture of Gaussians ◮ Covariance models continued: ◮ Σ j = diag ( { σ 2 jk } d k =1 ) � n i =1 p ( j | x i , Θ ( g ) )( x i k − ˆ µ j k ) 2 σ 2 ˆ jk = � n i =1 p ( j | x i , Θ ( g ) ) ◮ Σ j = Σ m n Σ = 1 ˆ � � p ( j | x i , Θ ( g ) )( x i − ˆ µ j ) T µ j )( x i − ˆ n j =1 i =1 ◮ Σ j = arbitrary � n i =1 p ( j | x i , Θ ( g ) )( x i − ˆ µ j ) T µ j )( x i − ˆ ˆ Σ j = � n i =1 p ( j | x i , Θ ( g ) ) CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 17 / 32

  18. Mixture of Gaussians ◮ Summary: ◮ Estimates for α j , µ j and Σ j perform both expectation and maximization steps simultaneously. ◮ EM iterations proceed by using the current estimates as the initial estimates for the next iteration. ◮ The priors are computed from the proportion of examples belonging to each mixture component. ◮ The means are the component centroids. ◮ The covariance matrices are calculated as the sample covariance of the points associated with each component. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 18 / 32

  19. Examples ◮ Mixture of Gaussians examples ◮ 1-D Bayesian classification examples ◮ 2-D Bayesian classification examples CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 19 / 32

  20. 2 2 2 �✂✁☎✄ 0 0 0 −2 −2 −2 (a) (b) (c) −2 0 2 −2 0 2 −2 0 2 2 2 2 �✂✁☎✄ �✂✁☎✄ �✂✁☎✄✝✆ 0 0 0 −2 −2 −2 (d) (e) (f) −2 0 2 −2 0 2 −2 0 2 Figure 1: Illustration of the EM algorithm iterations for a mixture of two Gaussians. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 20 / 32

Recommend


More recommend