On learning statistical mixtures maximizing the complete likelihood The k -MLE methodology using geometric hard clustering Frank NIELSEN ´ Ecole Polytechnique Sony Computer Science Laboratories MaxEnt 2014 September 21-26 2014 Amboise, France c � 2014 Frank Nielsen 1/39
Finite mixtures: Semi-parametric statistical models k � ◮ Mixture M ∼ MM ( W , Λ) with density m ( x ) = w i p ( x | λ i ) i =1 not sum of RVs!. Λ = { λ i } i , W = { w i } i ◮ Multimodal, universally modeling smooth densities ◮ Gaussian MMs with support X = R , Gamma MMs with support X = R + (modeling distances [34]) ◮ Pioneered by Karl Pearson [29] (1894). precursors: Francis Galton [13] (1869), Adolphe Quetelet [31] (1846), etc. ◮ Capture sub-populations within an overall population ( k = 2, crab data [29] in Pearson) c � 2014 Frank Nielsen 2/39
Example of k = 2-component mixture [17] Sub-populations ( k = 2) within an overall population... Sub-species in species, etc. Truncated distributions (what is the support! black swans ?!) c � 2014 Frank Nielsen 3/39
Sampling from mixtures: Doubly stochastic process To sample a variate x from a MM: ◮ Choose a component l according to the weight distribution w 1 , ..., w k (multinomial), ◮ Draw a variate x according to p ( x | λ l ). c � 2014 Frank Nielsen 4/39
Statistical mixtures: Generative data models Image = 5D xyRGB point set GMM = feature descriptor for information retrieval (IR) Increase dimension d using color image s × s patches : d = 2 + 3 s 2 Source GMM Sample (stat img) Low-frequency information encoded into compact statistical model. c � 2014 Frank Nielsen 5/39
Mixtures: ǫ -statistically learnable and ǫ -estimates Problem statement: Given n IID d -dimensional observations x 1 , ..., x n ∼ MM (Λ , W ), estimate MM (ˆ Λ , ˆ W ): ◮ Theoretical Computer Science (TCS) approach: ǫ -closely parameter recovery ( π : permutation) ◮ | w i − ˆ w π ( i ) | ≤ ǫ ◮ KL ( p ( x | λ i ) : p ( x | ˆ λ π ( i ) )) ≤ ǫ (or other divergences like TV, etc.) Consider ǫ -learnable MMs: ◮ min i w i ≥ ǫ ◮ KL ( p ( x | λ i ) : p ( x | λ i )) ≥ ǫ, ∀ i � = j (or other divergence) ◮ Statistical approach : Define the best model/MM as the one maximizing the likelihood function l (Λ , W ) = � i m ( x i | Λ , W ). c � 2014 Frank Nielsen 6/39
Mixture inference: Incomplete versus complete likelihood ◮ Sub-populations within an overall population: observed data x i does not include the subpopulation label l i ◮ k = 2: Classification and Bayes error (upper bounded by Chernoff information [24]) ◮ Inference: Assume IID, maximize (log)-likelihood: ◮ Complete using indicator variables z i , j (for l i : z i , l i = 1): n k � � � � ( w j p ( x i | θ j )) z i , j = l c = log z i , j log( w j p ( x i | θ j )) i =1 j =1 i j ◮ Incomplete (hidden/latent variables) and log-sum intractability : � � � l i = log m ( x | W , Λ) = log w j p ( x i | θ j ) i i j c � 2014 Frank Nielsen 7/39
Mixture learnability and inference algorithms ◮ Which criterion to maximize? incomplete or complete likelihood? What kind of evaluation criteria? ◮ From Expectation-Maximization [8] (1977) to TCS methods : Polynomial learnability of mixtures [22, 15] (2014), mixtures and core-sets [10] for massive data sets, etc. Some technicalities: ◮ Many local maxima of likelihood functions l i and l c (EM converges locally and needs a stopping criterion) ◮ Multimodal density (# modes > k [9], ghost modes even for isotropic GMMs) ◮ Identifiability (permutation of labels, parameter distinctness) ◮ Irregularity: Fisher information may be zero [6], convergence speed of EM ◮ etc. c � 2014 Frank Nielsen 8/39
Learning MMs: A geometric hard clustering viewpoint n � k max W , Λ l c ( W , Λ) = max max j =1 log( w j p ( x i | θ j )) Λ i =1 � ≡ min min j ( − log p ( x i | θ j ) − log w j ) W , Λ i n k � = min min j =1 D j ( x i ) , W , Λ i =1 where c j = ( w j , θ j ) ( cluster prototype ) and D j ( x i ) = − log p ( x i | θ j ) − log w j are potential distance-like functions . ◮ Maximizing the complete likelihood amounts to a geometric hard clustering [37, 11] for fixed w j ’s (distance D j ( · ) depends � on cluster prototypes c j ): min Λ i min j D j ( x i ). ◮ Related to classification EM [5] (CEM), hard/truncated EM ◮ Solution of arg max l c to initialize l i (optimized by EM) c � 2014 Frank Nielsen 9/39
The k -MLE method: k -means type clustering algorithms k -MLE: 1. Initialize weight W (in open probability simplex ∆ k ) � 2. Solve min Λ i min j D j ( x i ) ( center-based clustering , W fixed) � 3. Solve min W i min j D j ( x i ) (Λ fixed) 4. Test for convergence and go to step 2) otherwise. ⇒ group coordinate ascent (ML)/descent (distance) optimization. c � 2014 Frank Nielsen 10/39
k -MLE: Center-based clustering, W fixed � Solve min min D j ( x i ) Λ j i k -means type convergence proof for assignment/relocation: ◮ Data assignment : ∀ i , l i = arg max j w j p ( x | λ j ) = arg min j D j ( x i ), C j = { x i | l i = j } ◮ Center relocation : ∀ j , λ j = MLE ( C j ) Farthest Maximum Likelihood (FML) Voronoi diagram : Vor FML ( c i ) = { x ∈ X : w i p ( x | λ i ) ≥ w j p ( x | λ j ) , ∀ i � = j } Vor ( c i ) = { x ∈ X : D i ( x ) ≤ D j ( x ) , ∀ i � = j } FML Voronoi ≡ additively weighted Voronoi with: D l ( x ) = − log p ( x | λ l ) − log w l c � 2014 Frank Nielsen 11/39
k -MLE: Example for mixtures of exponential families Exponential family: Component density p ( x | θ ) = exp( t ( x ) ⊤ θ − F ( θ ) + k ( x )) is log-concave with: ◮ t ( x ): sufficient statistic in R D , D : family order. ◮ k ( x ): auxiliary carrier term (wrt Lebesgue/counting measure) ◮ F ( θ ): log-normalized, cumulant function, log-partition. D j ( x ) is convex: Clustering k -means wrt convex “distances”. Farthest ML Voronoi ≡ additively-weighted Bregman Voronoi [4]: F ( θ ) − t ( x ) ⊤ θ − k ( x ) − log w − log p ( x ; θ ) − log w = B F ∗ ( t ( x ) : η ) + F ∗ ( t ( x )) + k ( x ) − log w = F ∗ ( η ) = max θ ( θ ⊤ η − F ( θ )): Legendre-Fenchel convex conjugate c � 2014 Frank Nielsen 12/39
Exponential families: Rayleigh distributions [36, 25] Application: IntraVascular UltraSound (IVUS) imaging: Rayleigh distribution: λ 2 e − x 2 p ( x ; λ ) = x 2 λ 2 x ∈ R + = X d = 1 (univariate) D = 1 (order 1) θ = − 1 2 λ 2 Θ = ( −∞ , 0) F ( θ ) = − log( − 2 θ ) t ( x ) = x 2 k ( x ) = log x (Weibull for k = 2) Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissues Rayleigh Mixture Models ( RMMs ): for segmentation and classification tasks c � 2014 Frank Nielsen 13/39
Exponential families: Multivariate Gaussians [14, 25] Gaussian Mixture Models (GMMs). (Color image interpreted as a 5D xyRGB point set) Gaussian distribution p ( x ; µ, Σ): | Σ | e − 1 1 2 D Σ − 1 ( x − µ, x − µ ) 2 √ d (2 π ) Squared Mahalanobis distance: D Q ( x , y ) = ( x − y ) T Q ( x − y ) x ∈ R d = X d (multivariate) D = d ( d +3) (order) 2 θ = (Σ − 1 µ, 1 2 Σ − 1 ) = ( θ v , θ M ) Θ = R × S d ++ 1 v θ − 1 M θ v − 1 4 θ T F ( θ ) = 2 log | θ M | + d 2 log π t ( x ) = ( x , − xx T ) k ( x ) = 0 c � 2014 Frank Nielsen 14/39
The k -MLE method for exponential families k -MLEEF: 1. Initialize weight W (in open probability simplex ∆ k ) � 2. Solve min Λ i min j ( B F ∗ ( t ( x ) : η j ) − log w j ) � 3. Solve min W i min j D j ( x i ) 4. Test for convergence and go to step 2) otherwise. Assignment condition in Step 2: additively-weighted Bregman Voronoi diagram. c � 2014 Frank Nielsen 15/39
k -MLE: Solving for weights given component parameters � Solve min min D j ( x i ) W j i Amounts to arg min W − n j log w j = arg min W − n j n log w j where n j = # { x i ∈ Vor ( c j ) } = |C j | . H × ( N : W ) min W ∈ ∆ k where N = ( n 1 n , ..., n k n ) is cluster point proportion vector ∈ ∆ k . Cross-entropy H × is minimized when H × ( N : W ) = H ( N ) that is W = N . Kullback-Leibler divergence: KL ( N : W ) = H × ( N : W ) − H ( N ) = 0 when W = N . c � 2014 Frank Nielsen 16/39
MLE for exponential families Given a ML farthest Voronoi partition, computes MLEs θ j ’s: � ˆ θ j = arg max p F ( x i ; θ ) θ ∈ Θ x i ∈ Vor ( c j ) is unique (***) maximum since ∇ 2 F ( θ ) ≻ 0: θ j ) = 1 � Moment equation : ∇ F (ˆ θ j ) = η (ˆ t ( x i ) = ¯ t = ˆ η n j x i ∈ Vor ( c j ) MLE is consistent , efficient with asymptotic normal distribution : � � θ j , 1 ˆ I − 1 ( θ j ) θ j ∼ N n j Fisher information matrix I ( θ j ) = var [ t ( X )] = ∇ 2 F ( θ j ) = ( ∇ 2 F ∗ ) − 1 ( η j ) MLE may be biased (eg, normal distributions). c � 2014 Frank Nielsen 17/39
Existence of MLEs for exponential families (***) For minimal and full EFs, MLE guaranteed to exist [3, 21] provided that matrix: 1 t 1 ( x 1 ) ... t D ( x 1 ) . . . . . . . . T = (1) . . . . 1 t 1 ( x n ) ... t D ( x n ) of dimension n × ( D + 1) has rank D + 1 [3]. For example, problems for MLEs of MVNs with n < d observations (undefined with likelihood ∞ ). � t = 1 Condition: ¯ x i ∈ Vor ( c j ) t ( x i ) ∈ int ( C ), where C is closed n j convex support . c � 2014 Frank Nielsen 18/39
Recommend
More recommend