Online k -MLE for mixture modelling with exponential families Christophe Saint-Jean Frank Nielsen G eometry S cience I nformation 2015 Oct 28-30, 2015 - Ecole Polytechnique, Paris-Saclay
Application Context We are interested in building a system (a model) which evolves when new data is available: x 1 , x 2 , . . . , x N , . . . The time needed for processing a new observation must be constant w.r.t the number of observations. The memory required by the system is bounded. Denote π the unknown distribution of X 2/27
Outline of this talk Online learning exponential families 1 Online learning of mixture of exponential families 2 Introduction, EM, k -MLE Recursive EM, Online EM Stochastic approximations of k -MLE Experiments Conclusions 3 3/27
Reminder : (Regular) Exponential Family Firstly, π will be approximated by a member of a (regular) exponential family (EF): E F = { f ( x ; θ ) = exp {� s ( x ) , θ � + k ( x ) − F ( θ ) | θ ∈ Θ } Terminology: λ source parameters. F ( θ ) the log-normalizer: θ natural parameters. differentiable, strictly convex η expectation parameters. Θ = { θ ∈ R D | F ( θ ) < ∞} s ( x ) sufficient statistic. is an open convex set k ( x ) auxiliary carrier measure. Almost all common distributions are EF members but uniform, Cauchy distributions. 4/27
Reminder : Maximum Likehood Estimate (MLE) Maximum Likehood Estimate for general p.d.f: N N − 1 θ ( N ) = argmax ˆ � � f ( x i ; θ ) = argmin log f ( x i ; θ ) N θ θ i =1 i =1 assuming a sample χ = { x 1 , x 2 , ..., x N } of i.i.d observations. Maximum Likehood Estimate for an EF: � � � � 1 θ ( N ) = argmin ˆ � − s ( x i ) , θ − cst ( χ ) + F ( θ ) N θ i which is exactly solved in H , the space of expectation parameters: � � θ ( N ) ) = 1 1 η ( N ) = ∇ F (ˆ θ ( N ) = ( ∇ F ) − 1 � ˆ � ˆ s ( x i ) ≡ s ( x i ) N N i i 5/27
Exact Online MLE for exponential family A recursive formulation is easily obtained Algorithm 1: Exact Online MLE for EF Input : a sequence S of observations Input : Functions s and ( ∇ F ) − 1 for some EF Output : a sequence of MLE for all observations seen before η (0) = 0; ˆ N = 1; for x N ∈ S do η ( N ) = ˆ η ( N − 1) + N − 1 ( s ( x N ) − ˆ η ( N − 1) ); ˆ η ( N ) or yield ( ∇ F ) − 1 (ˆ η ( N ) ); yield ˆ N = N + 1; Analytical expressions of ( ∇ F ) − 1 exist for most EF (but not all) 6/27
Case of Multivariate normal distribution (MVN) Probability density function of MVN: 2 ( x − µ ) T Σ − 1 ( x − µ ) N ( x ; µ, Σ) = (2 π ) − d 2 | Σ | − 1 2 exp − 1 One possible decomposition: N ( x ; θ 1 , θ 2 ) = exp {� θ 1 , x � + � θ 2 , − xx T � F − 1 2 θ 1 − d 2 log( π ) + 1 t θ 1 θ − 1 2 log | θ 2 |} 4 � s ( x ) = ( x , − xx T ) = ⇒ ( ∇ F ) − 1 ( η 1 , η 2 ) = 1 − η 2 ) − 1 η 1 , 1 1 − η 2 ) − 1 � � ( − η 1 η T 2 ( − η 1 η T 7/27
Case of the Wishart distribution See details in the paper. 8/27
Finite (parametric) mixture models Now, π will be approximated by a finite (parametric) mixture f ( · ; θ ) indexed by θ : K K � � π ( x ) ≈ f ( x ; θ ) = w j f j ( x ; θ j ) , 0 ≤ w j ≤ 1 , w j = 1 j =1 j =1 where w j are the mixing proportions, f j are the component distributions. When all f j ’s are EFs, it is called a Mixture of EFs (MEF). 0.25 Unknown true distribution f* Mixture distribution f 0.1 * dnorm(x) + 0.6 * dnorm(x, 4, 2) + 0.3 * dnorm(x, −2, 0.5) Components density functions f_j 0.20 0.15 0.10 0.05 0.00 9/27 −5 0 5 10 x
Incompleteness in mixture models incomplete complete deterministic observable ← unobservable χ = { x 1 , . . . , x N } χ c = { y 1 = ( x 1 , z 1 ) , . . . , y N } Z i ∼ cat K ( w ) X i | Z i = j ∼ f j ( · ; θ j ) For a MEF, the joint density p ( x , z ; θ ) is an EF: K � log p ( x , z ; θ ) = [ z = j ] { log( w j ) + � θ j , s j ( x ) � + k j ( x ) − F j ( θ j ) } j =1 K �� [ z = j ] � � log w j − F j ( θ j ) �� � = , + k ( x , z ) [ z = j ] s j ( x ) θ j j =1 10/27
Expectation-Maximization (EM) [1] The EM algorithm maximizes iteratively Q ( θ ; ˆ θ ( t ) , χ ). Algorithm 2: EM algorithm θ (0) initial parameters of the model Input : ˆ Input : χ ( N ) = { x 1 , . . . , x N } θ ( t ∗ ) of log f ( χ ; θ ) Output : A (local) maximizer ˆ t ← 0; repeat Compute Q ( θ ; ˆ θ ( t ) , χ ) := E ˆ θ ( t ) [log p ( χ c ; θ ) | χ ] ; // E-Step θ ( t +1) = argmax θ Q ( θ ; ˆ Choose ˆ θ ( t ) , χ ) ; // M-Step t ← t +1; until Convergence of the complete log-likehood ; 11/27
EM for MEF For a mixture, the E-Step is always explicit: z ( t ) w ( t ) f ( x i ; ˆ θ ( t ) w ( t ) j ′ f ( x i ; ˆ θ ( t ) � ˆ i , j = ˆ ) / ˆ j ′ ) j j j ′ For a MEF, the M-Step then reduces to: K �� z ( t ) � �� � i ˆ � log w j − F j ( θ j ) θ ( t +1) = argmax ˆ � i , j , z ( t ) θ j � i ˆ i , j s j ( x i ) { w j ,θ j } j =1 N w ( t +1) z ( t ) � ˆ = ˆ i , j / N j i =1 z ( t ) � i ˆ i , j s j ( x i ) η ( t +1) θ ( t +1) = ∇ F (ˆ ˆ ) = ( weighted average of SS ) j j z ( t ) � i ˆ i , j 12/27
k -Maximum Likelihood Estimator ( k -MLE) [2] χ ( t ) The k-MLE introduces a geometric split χ = � K j =1 ˆ to j accelerate EM : z ( t ) w j ′ f ( x i ; ˆ θ ( t ) ˜ i , j = [argmax j ′ ) = j ] j ′ Equivalently, it amounts to maximize Q over partition Z [3] For a MEF, the M-Step of the k -MLE then reduces to: χ ( t ) K �� � �� | ˆ j | � log w j − F j ( θ j ) θ ( t +1) = argmax ˆ � , � s j ( x i ) θ j χ ( t ) { w j ,θ j } x i ∈ ˆ j =1 j � s j ( x i ) χ ( t ) x i ∈ ˆ w ( t +1) χ ( t ) η ( t +1) = ∇ F (ˆ θ ( t +1) j ˆ = | ˆ j | / N ˆ ) = j j j χ ( t ) | ˆ j | ( cluster-wise unweighted average of SS ) 13/27
Online learning of mixtures Consider now the online setting x 1 , x 2 , . . . , x N , . . . θ ( N ) or ˆ η ( N ) the parameter estimate after dealing N Denote ˆ observations θ (0) or ˆ η (0) their initial values Denote ˆ Remark: For a fixed-size dataset χ , one may apply multiple passes (with shuffle) on χ . The increase in the likelihood function is no more guaranteed after an iteration. 14/27
Stochastic approximations of EM(1) Two main approaches to online EM-like estimation: Stochastic M-Step : Recursive EM (1984) [5] θ ( N ) = ˆ θ ( N − 1) + { NI c (ˆ ˆ θ ( N − 1) } − 1 ∇ θ log f ( x N ; ˆ θ ( N − 1) ) where I c is the Fisher Information matrix for the complete data: � log p ( x , z ; θ ) � I c (ˆ θ ( N − 1) ) = − E ˆ θ ( N − 1) ∂θ∂θ T j A justification for this formula comes from the Fisher’s Identity: ∇ log f ( x ; θ ) = E θ [log p ( x , z ; θ ) | x ] One can recognize a second order Stochastic Gradient Ascent which requires to update and invert I c after each iteration. 15/27
Stochastic approximations of EM(2) Stochastic E-Step : Online EM (2009) [7] Q ( N − 1) ( θ )+ α ( N ) � � Q ( N ) ( θ ) = ˆ ˆ θ ( N − 1) [log p ( x N , z N ; θ ) | x N ] − ˆ Q ( N − 1) ( θ ) E ˆ In case of a MEF, the algorithm works only with the cond. expectation of the sufficient statistics for complete data. z N , j = E θ ( N − 1) [ z N , j | x N ] ˆ � ˆ � ˆ � ˆ S ( N ) � S ( N − 1) � �� S ( N − 1) �� � z N , j ˆ w j w j w j + α ( N ) = − S ( N ) S ( N − 1) S ( N − 1) ˆ ˆ ˆ z N , j s j ( x N ) ˆ θ j θ j θ j The M -Step is unchanged: w ( N ) η ( N ) S ( N ) = ˆ ˆ = ˆ w j w j j θ ( N ) η ( N ) S ( N ) S ( N ) ˆ = ˆ / ˆ = ( ∇ F j ) − 1 (ˆ w j ) j θ j θ j 16/27
Stochastic approximations of EM(3) Some properties: S (0) may be used for introducing a ”prior”: Initial values ˆ S (0) S (0) = w j η (0) ˆ w j = w j , ˆ θ j j Parameters constraints are automatically respected No matrix to invert ! Policy for α ( N ) has to be chosen (see [7]) Consistent, asymptotically equivalent to the recursive EM !! 17/27
Stochastic approximations of k-MLE(1) In order to keep previous advantages of online EM for an online k -MLE, our only choice concerns the way to affect x N to a cluster. Strategy 1 Maximize the likelihood of the complete data ( x N , z N ) w ( N − 1) θ ( N − 1) f ( x N ; ˆ z N , j = [argmax ˜ ˆ ) = j ] j ′ j ′ j ′ Equivalent to Online CEM and similar to Mac-Queen iterative k-Means. 18/27
Stochastic approximations of k-MLE(2) Strategy 2 Maximize the likelihood of the complete data ( x N , z N ) after the M-Step: w ( N ) θ ( N ) f ( x N ; ˆ z N , j = [argmax ˜ ˆ ) = j ] j ′ j ′ j ′ Similar to Hartigan’s method for k -means. Additional cost: pre-compute all possible M-Steps for the Stochastic E -Step. 19/27
Stochastic approximations of k-MLE(3) Strategy 3 Draw ˜ z N , j from the categorical distribution w ( N − 1) θ ( N − 1) f j ( x N ; ˆ z N sampled from Cat K ( { p j = log( ˆ ˜ )) } j ) j j Similar to sampling in Stochastic EM [3] The motivation is to try to break the inconsistency of k -MLE. For strategies 1 and 3, the M -Step reduces the update of the parameters for a single component. 20/27
Recommend
More recommend