On learning statistical mixtures maximizing the complete likelihood - PowerPoint PPT Presentation

On learning statistical mixtures maximizing the complete likelihood The k -MLE methodology using geometric hard clustering Frank NIELSEN ´ Ecole Polytechnique Sony Computer Science Laboratories MaxEnt 2014 September 21-26 2014 Amboise, France c � 2014 Frank Nielsen 1/39

Finite mixtures: Semi-parametric statistical models k � ◮ Mixture M ∼ MM ( W , Λ) with density m ( x ) = w i p ( x | λ i ) i =1 not sum of RVs!. Λ = { λ i } i , W = { w i } i ◮ Multimodal, universally modeling smooth densities ◮ Gaussian MMs with support X = R , Gamma MMs with support X = R + (modeling distances [34]) ◮ Pioneered by Karl Pearson [29] (1894). precursors: Francis Galton [13] (1869), Adolphe Quetelet [31] (1846), etc. ◮ Capture sub-populations within an overall population ( k = 2, crab data [29] in Pearson) c � 2014 Frank Nielsen 2/39

Example of k = 2-component mixture [17] Sub-populations ( k = 2) within an overall population... Sub-species in species, etc. Truncated distributions (what is the support! black swans ?!) c � 2014 Frank Nielsen 3/39

Sampling from mixtures: Doubly stochastic process To sample a variate x from a MM: ◮ Choose a component l according to the weight distribution w 1 , ..., w k (multinomial), ◮ Draw a variate x according to p ( x | λ l ). c � 2014 Frank Nielsen 4/39

Statistical mixtures: Generative data models Image = 5D xyRGB point set GMM = feature descriptor for information retrieval (IR) Increase dimension d using color image s × s patches : d = 2 + 3 s 2 Source GMM Sample (stat img) Low-frequency information encoded into compact statistical model. c � 2014 Frank Nielsen 5/39

Mixtures: ǫ -statistically learnable and ǫ -estimates Problem statement: Given n IID d -dimensional observations x 1 , ..., x n ∼ MM (Λ , W ), estimate MM (ˆ Λ , ˆ W ): ◮ Theoretical Computer Science (TCS) approach: ǫ -closely parameter recovery ( π : permutation) ◮ | w i − ˆ w π ( i ) | ≤ ǫ ◮ KL ( p ( x | λ i ) : p ( x | ˆ λ π ( i ) )) ≤ ǫ (or other divergences like TV, etc.) Consider ǫ -learnable MMs: ◮ min i w i ≥ ǫ ◮ KL ( p ( x | λ i ) : p ( x | λ i )) ≥ ǫ, ∀ i � = j (or other divergence) ◮ Statistical approach : Define the best model/MM as the one maximizing the likelihood function l (Λ , W ) = � i m ( x i | Λ , W ). c � 2014 Frank Nielsen 6/39

Mixture inference: Incomplete versus complete likelihood ◮ Sub-populations within an overall population: observed data x i does not include the subpopulation label l i ◮ k = 2: Classification and Bayes error (upper bounded by Chernoff information [24]) ◮ Inference: Assume IID, maximize (log)-likelihood: ◮ Complete using indicator variables z i , j (for l i : z i , l i = 1): n k � � � � ( w j p ( x i | θ j )) z i , j = l c = log z i , j log( w j p ( x i | θ j )) i =1 j =1 i j ◮ Incomplete (hidden/latent variables) and log-sum intractability :   � � � l i = log m ( x | W , Λ) = log w j p ( x i | θ j )  i i j c � 2014 Frank Nielsen 7/39

Mixture learnability and inference algorithms ◮ Which criterion to maximize? incomplete or complete likelihood? What kind of evaluation criteria? ◮ From Expectation-Maximization [8] (1977) to TCS methods : Polynomial learnability of mixtures [22, 15] (2014), mixtures and core-sets [10] for massive data sets, etc. Some technicalities: ◮ Many local maxima of likelihood functions l i and l c (EM converges locally and needs a stopping criterion) ◮ Multimodal density (# modes > k [9], ghost modes even for isotropic GMMs) ◮ Identifiability (permutation of labels, parameter distinctness) ◮ Irregularity: Fisher information may be zero [6], convergence speed of EM ◮ etc. c � 2014 Frank Nielsen 8/39

Learning MMs: A geometric hard clustering viewpoint n � k max W , Λ l c ( W , Λ) = max max j =1 log( w j p ( x i | θ j )) Λ i =1 � ≡ min min j ( − log p ( x i | θ j ) − log w j ) W , Λ i n k � = min min j =1 D j ( x i ) , W , Λ i =1 where c j = ( w j , θ j ) ( cluster prototype ) and D j ( x i ) = − log p ( x i | θ j ) − log w j are potential distance-like functions . ◮ Maximizing the complete likelihood amounts to a geometric hard clustering [37, 11] for fixed w j ’s (distance D j ( · ) depends � on cluster prototypes c j ): min Λ i min j D j ( x i ). ◮ Related to classification EM [5] (CEM), hard/truncated EM ◮ Solution of arg max l c to initialize l i (optimized by EM) c � 2014 Frank Nielsen 9/39

The k -MLE method: k -means type clustering algorithms k -MLE: 1. Initialize weight W (in open probability simplex ∆ k ) � 2. Solve min Λ i min j D j ( x i ) ( center-based clustering , W fixed) � 3. Solve min W i min j D j ( x i ) (Λ fixed) 4. Test for convergence and go to step 2) otherwise. ⇒ group coordinate ascent (ML)/descent (distance) optimization. c � 2014 Frank Nielsen 10/39

k -MLE: Center-based clustering, W fixed � Solve min min D j ( x i ) Λ j i k -means type convergence proof for assignment/relocation: ◮ Data assignment : ∀ i , l i = arg max j w j p ( x | λ j ) = arg min j D j ( x i ), C j = { x i | l i = j } ◮ Center relocation : ∀ j , λ j = MLE ( C j ) Farthest Maximum Likelihood (FML) Voronoi diagram : Vor FML ( c i ) = { x ∈ X : w i p ( x | λ i ) ≥ w j p ( x | λ j ) , ∀ i � = j } Vor ( c i ) = { x ∈ X : D i ( x ) ≤ D j ( x ) , ∀ i � = j } FML Voronoi ≡ additively weighted Voronoi with: D l ( x ) = − log p ( x | λ l ) − log w l c � 2014 Frank Nielsen 11/39

k -MLE: Example for mixtures of exponential families Exponential family: Component density p ( x | θ ) = exp( t ( x ) ⊤ θ − F ( θ ) + k ( x )) is log-concave with: ◮ t ( x ): sufficient statistic in R D , D : family order. ◮ k ( x ): auxiliary carrier term (wrt Lebesgue/counting measure) ◮ F ( θ ): log-normalized, cumulant function, log-partition. D j ( x ) is convex: Clustering k -means wrt convex “distances”. Farthest ML Voronoi ≡ additively-weighted Bregman Voronoi [4]: F ( θ ) − t ( x ) ⊤ θ − k ( x ) − log w − log p ( x ; θ ) − log w = B F ∗ ( t ( x ) : η ) + F ∗ ( t ( x )) + k ( x ) − log w = F ∗ ( η ) = max θ ( θ ⊤ η − F ( θ )): Legendre-Fenchel convex conjugate c � 2014 Frank Nielsen 12/39

Exponential families: Rayleigh distributions [36, 25] Application: IntraVascular UltraSound (IVUS) imaging: Rayleigh distribution: λ 2 e − x 2 p ( x ; λ ) = x 2 λ 2 x ∈ R + = X d = 1 (univariate) D = 1 (order 1) θ = − 1 2 λ 2 Θ = ( −∞ , 0) F ( θ ) = − log( − 2 θ ) t ( x ) = x 2 k ( x ) = log x (Weibull for k = 2) Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissues Rayleigh Mixture Models ( RMMs ): for segmentation and classification tasks c � 2014 Frank Nielsen 13/39

Exponential families: Multivariate Gaussians [14, 25] Gaussian Mixture Models (GMMs). (Color image interpreted as a 5D xyRGB point set) Gaussian distribution p ( x ; µ, Σ): | Σ | e − 1 1 2 D Σ − 1 ( x − µ, x − µ ) 2 √ d (2 π ) Squared Mahalanobis distance: D Q ( x , y ) = ( x − y ) T Q ( x − y ) x ∈ R d = X d (multivariate) D = d ( d +3) (order) 2 θ = (Σ − 1 µ, 1 2 Σ − 1 ) = ( θ v , θ M ) Θ = R × S d ++ 1 v θ − 1 M θ v − 1 4 θ T F ( θ ) = 2 log | θ M | + d 2 log π t ( x ) = ( x , − xx T ) k ( x ) = 0 c � 2014 Frank Nielsen 14/39

The k -MLE method for exponential families k -MLEEF: 1. Initialize weight W (in open probability simplex ∆ k ) � 2. Solve min Λ i min j ( B F ∗ ( t ( x ) : η j ) − log w j ) � 3. Solve min W i min j D j ( x i ) 4. Test for convergence and go to step 2) otherwise. Assignment condition in Step 2: additively-weighted Bregman Voronoi diagram. c � 2014 Frank Nielsen 15/39

k -MLE: Solving for weights given component parameters � Solve min min D j ( x i ) W j i Amounts to arg min W − n j log w j = arg min W − n j n log w j where n j = # { x i ∈ Vor ( c j ) } = |C j | . H × ( N : W ) min W ∈ ∆ k where N = ( n 1 n , ..., n k n ) is cluster point proportion vector ∈ ∆ k . Cross-entropy H × is minimized when H × ( N : W ) = H ( N ) that is W = N . Kullback-Leibler divergence: KL ( N : W ) = H × ( N : W ) − H ( N ) = 0 when W = N . c � 2014 Frank Nielsen 16/39

MLE for exponential families Given a ML farthest Voronoi partition, computes MLEs θ j ’s: � ˆ θ j = arg max p F ( x i ; θ ) θ ∈ Θ x i ∈ Vor ( c j ) is unique (***) maximum since ∇ 2 F ( θ ) ≻ 0: θ j ) = 1 � Moment equation : ∇ F (ˆ θ j ) = η (ˆ t ( x i ) = ¯ t = ˆ η n j x i ∈ Vor ( c j ) MLE is consistent , efficient with asymptotic normal distribution : � � θ j , 1 ˆ I − 1 ( θ j ) θ j ∼ N n j Fisher information matrix I ( θ j ) = var [ t ( X )] = ∇ 2 F ( θ j ) = ( ∇ 2 F ∗ ) − 1 ( η j ) MLE may be biased (eg, normal distributions). c � 2014 Frank Nielsen 17/39

Existence of MLEs for exponential families (***) For minimal and full EFs, MLE guaranteed to exist [3, 21] provided that matrix:   1 t 1 ( x 1 ) ... t D ( x 1 ) . . . .  . . . .  T = (1) . . . .   1 t 1 ( x n ) ... t D ( x n ) of dimension n × ( D + 1) has rank D + 1 [3]. For example, problems for MLEs of MVNs with n < d observations (undefined with likelihood ∞ ). � t = 1 Condition: ¯ x i ∈ Vor ( c j ) t ( x i ) ∈ int ( C ), where C is closed n j convex support . c � 2014 Frank Nielsen 18/39

On learning statistical mixtures maximizing the complete likelihood - PowerPoint PPT Presentation

On learning statistical mixtures maximizing the complete likelihood The k -MLE methodology using geometric hard clustering Frank NIELSEN Ecole Polytechnique Sony Computer Science Laboratories MaxEnt 2014 September 21-26 2014 Amboise,

Analysis of a model of elastic plastic mixtures (Prandtl-Reuss-mixtures) Project of Josef

Release granular mushrooms Release granular mushrooms and dried mixtures and dried mixtures

The science of mixtures and separation techniques Rahul Bhambure PhD Scientist, Chemical

Mixtures of models Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

Learning From Data Lecture 23 SVMs: Maximizing the Margin A Better Hyperplane Maximizing the

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Maximizing the Efficiency Potential Maximizing the Efficiency Potential in New Hampshire N

Member Orientation: Maximizing your SEEP Member Benefits Member Orientation: Maximizing your

Maximizing Anterior Vertebral Maximizing Anterior Vertebral Screw Fixation for Spinal Screw

Maximizing your slow cooker is about Maximizing the flavor of foods you prepare, which will

OLA 2009: OLA 2009: Maximizing the Value of Your Maximizing the Value of Your OCLC Cataloging

Lecture 4: Optimization Maximizing or Minimizing a Function of a Single Variable

Maximizing the Spread of Maximizing the Spread of I nfluence through a Social I nfluence through

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

10 Elements of a Complete Streets Policy National Complete Streets Coalition Tuesday, October 24

CLUSTERING Based on Foundations of Statistical NLP, C. Manning & H. Sch utze, MIT

Detection of faulty Beam Position Monitors E. Fol, R. Tomas Garcia Machine Learning Applications

Mixture Models and EM Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia

Classification method in single particle analysis Cluster Analysis Pawel A. Penczek

The impact of high dimension on clustering Gilles Celeux Inria Saclay-le-de-France, Universit

Density-based Clustering MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Local Algorithms and Large Scale Graph Mining Silvio Lattanzi (Google Research NY) Charles River

Clustering A Categorization of Major Clustering Methods Partitioning Methods

On learning statistical mixtures maximizing the complete likelihood - PowerPoint PPT Presentation

On learning statistical mixtures maximizing the complete likelihood The k -MLE methodology using geometric hard clustering Frank NIELSEN Ecole Polytechnique Sony Computer Science Laboratories MaxEnt 2014 September 21-26 2014 Amboise,

Analysis of a model of elastic plastic mixtures (Prandtl-Reuss-mixtures) Project of Josef

Release granular mushrooms Release granular mushrooms and dried mixtures and dried mixtures

The science of mixtures and separation techniques Rahul Bhambure PhD Scientist, Chemical

Mixtures of models Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

Learning From Data Lecture 23 SVMs: Maximizing the Margin A Better Hyperplane Maximizing the

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Maximizing the Efficiency Potential Maximizing the Efficiency Potential in New Hampshire N

Member Orientation: Maximizing your SEEP Member Benefits Member Orientation: Maximizing your

Maximizing Anterior Vertebral Maximizing Anterior Vertebral Screw Fixation for Spinal Screw

Maximizing your slow cooker is about Maximizing the flavor of foods you prepare, which will

OLA 2009: OLA 2009: Maximizing the Value of Your Maximizing the Value of Your OCLC Cataloging

Lecture 4: Optimization Maximizing or Minimizing a Function of a Single Variable

Maximizing the Spread of Maximizing the Spread of I nfluence through a Social I nfluence through

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

10 Elements of a Complete Streets Policy National Complete Streets Coalition Tuesday, October 24

CLUSTERING Based on Foundations of Statistical NLP, C. Manning &amp; H. Sch utze, MIT

Detection of faulty Beam Position Monitors E. Fol, R. Tomas Garcia Machine Learning Applications

Mixture Models and EM Henrik I. Christensen Robotics &amp; Intelligent Machines @ GT Georgia

Classification method in single particle analysis Cluster Analysis Pawel A. Penczek

The impact of high dimension on clustering Gilles Celeux Inria Saclay-le-de-France, Universit

Density-based Clustering MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Local Algorithms and Large Scale Graph Mining Silvio Lattanzi (Google Research NY) Charles River

Clustering A Categorization of Major Clustering Methods Partitioning Methods

CLUSTERING Based on Foundations of Statistical NLP, C. Manning & H. Sch utze, MIT

Mixture Models and EM Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia