Information Geometry for mixtures Co-Mixture Models Bag of components Bag-of-components: an online algorithm for batch learning of mixture models Olivier Schwander Frank Nielsen Université Pierre et Marie Curie, Paris, France École polytechnique, Palaiseau, France October 29, 2015 1 / 20
Information Geometry for mixtures Exponential families Co-Mixture Models Bregman divergences Bag of components Mixture models Exponential families Definition p ( x ; λ ) = p F ( x ; θ ) = exp ( � t ( x ) | θ � − F ( θ ) + k ( x )) ◮ λ source parameter ◮ t ( x ) sufficient statistic ◮ θ natural parameter ◮ F ( θ ) log-normalizer ◮ k ( x ) carrier measure F is a stricly convex and differentiable function �·|·� is a scalar product 2 / 20
Information Geometry for mixtures Exponential families Co-Mixture Models Bregman divergences Bag of components Mixture models Multiple parameterizations: dual parameter spaces Multiple source parameterizations Source Parameters (not unique) λ 1 ∈ Λ 1 , λ 2 ∈ Λ 2 , . . . , λ n ∈ Λ n Legendre Transform ( F, Θ) ↔ ( F ⋆ , H ) θ = ∇ F ⋆ ( η ) η = ∇ F ( θ ) θ ∈ Θ η ∈ H Natural Parameters Expectation Parameters Two canonical parameterizations 3 / 20
Information Geometry for mixtures Exponential families Co-Mixture Models Bregman divergences Bag of components Mixture models Bregman divergences Definition and properties B F ( x � y ) = F ( x ) − F ( y ) − � x − y , ∇ F ( y ) � ◮ F is a stricly convex and differentiable function ◮ No symmetry! Contains a lot of common divergences ◮ Squared Euclidean, Mahalanobis, Kullback-Leibler, Itakura-Saito. . . 4 / 20
Information Geometry for mixtures Exponential families Co-Mixture Models Bregman divergences Bag of components Mixture models Bregman centroids Right-sided centroid Left-sided centroid � � min ω i B F ( c � x i ) min ω i B F ( x i � c ) c c i i Closed-form �� � � c R = c L = ∇ F ∗ ω i x i ω i ∇ F ( x i ) i i 5 / 20
Information Geometry for mixtures Exponential families Co-Mixture Models Bregman divergences Bag of components Mixture models Link with exponential families [Banerjee 2005] Bijection with exponential families log p F ( x | θ ) = − B F ∗ ( t ( x ) � η ) + F ∗ ( t ( x )) + k ( x ) Kullback-Leibler between exponential families ◮ between members of the same exponential family KL ( p F ( x , θ 1 ) , p F ( x , θ 2 )) = B F ( θ 2 � θ 1 ) = B F ⋆ ( η 1 � η 2 ) Kullback-Leibler centroids ◮ In closed-form through the Bregman divergence 6 / 20
Information Geometry for mixtures Exponential families Co-Mixture Models Bregman divergences Bag of components Mixture models Maximum likelihood estimator A Bregman centroid � η = arg max ˆ log p F ( x i , η ) η i � B F ∗ ( t ( x i ) � η ) − F ∗ ( t ( x i )) − k ( x i ) = arg min � �� � η i does not depend on η � = arg min B F ∗ ( t ( x i ) � η ) η i � = t ( x i ) i And ˆ θ = ∇ F ⋆ (ˆ η ) 7 / 20
Information Geometry for mixtures Exponential families Co-Mixture Models Bregman divergences Bag of components Mixture models Mixtures of exponential families � m ( x ; ω, θ ) = ω i p F ( x ; θ i ) 1 ≤ i ≤ k Fixed Parameters ◮ Family of the components P F ◮ Weights � i ω i = 1 ◮ Number of components k ◮ Component parameters θ i (model selection techniques to choose) Learning a mixture ◮ Input: observations x 1 , . . . , x N ◮ Output: ω i and θ i 8 / 20
Information Geometry for mixtures Exponential families Co-Mixture Models Bregman divergences Bag of components Mixture models Bregman Soft Clustering: EM for exponential families [Banerjee 2005] E-step p ( i , j ) = ω j p F ( x i , θ j ) m ( x i ) M-step � η j = arg max p ( i , j ) log p F ( x i , θ j ) η i � B F ∗ ( t ( x i ) � η ) − F ∗ ( t ( x i )) − k ( x i ) = arg min p ( i , j ) � �� � η i does not depend on η � p ( i , j ) = u p ( u , j ) t ( x u ) � i 9 / 20
Information Geometry for mixtures Motivation Co-Mixture Models Algorithms Bag of components Applications Joint estimation of mixture models Exploit shared information between multiple pointsets ◮ to improve quality ◮ to improve speed Inspiration Efficient algorithms ◮ Dictionary methods ◮ Building ◮ Transfer learning ◮ Comparing 10 / 20
Information Geometry for mixtures Motivation Co-Mixture Models Algorithms Bag of components Applications Co-Mixtures Sharing components of all the mixtures k � ω (1) m 1 ( x | ω (1) , η ) = p F ( x | η j ) i i =1 . . . k � ω ( S ) m S ( x | ω ( S ) , η ) = p F ( x | η j ) i i =1 ◮ Same η 1 . . . η k everywhere ◮ Different weights ω ( l ) 11 / 20
Information Geometry for mixtures Motivation Co-Mixture Models Algorithms Bag of components Applications co-Expectation-Maximization Maximize the mean of the likelihoods on each mixtures E-step ◮ A posterior matrix for each dataset ω ( l ) j p F ( x i , θ j ) p ( l ) ( i , j ) = m ( x ( l ) | ω ( l ) , η ) i M-step ◮ Maximization on each dataset � p ( i , j ) η ( l ) u p ( l ) ( u , j ) t ( x ( l ) = u ) � j i ◮ Aggregation S � η j = 1 η ( l ) j S l =1 12 / 20
Information Geometry for mixtures Motivation Co-Mixture Models Algorithms Bag of components Applications Variational approximation of Kullback-Leibler [Hershey Olsen 2007] � j ω (1) e − KL ( p F ( · ; θ i ) � p F ( · ; θ j )) K � ω (1) j � KL Variationnal ( m 1 , m 2 ) = log � i j ω (2) e − KL ( p F ( · ; θ i ) � p F ( · ; θ j )) i =1 j With shared parameters ◮ Precompute D ij = e − KL ( p F ( ·| η i ) , p F ( ·| η j )) Fast version � j ω (1) e − D ij � ω (1) j KL var ( m 1 � m 2 ) = log � i j ω (2) e − D ij i j 13 / 20
Information Geometry for mixtures Motivation Co-Mixture Models Algorithms Bag of components Applications co-Segmentation Segmentation from 5D RGBxy mixtures Original EM Co-EM 14 / 20
Information Geometry for mixtures Motivation Co-Mixture Models Algorithms Bag of components Applications Transfer learning Increase the quality of one particular mixture of interest ◮ First image: only 1% of the points ◮ Two other images: full set of points ◮ Not enough points for EM 15 / 20
Information Geometry for mixtures Algorithm Co-Mixture Models Experiments Bag of components Bag of Components Training step ◮ Comix on some training set ◮ Keep the parameters ◮ Costly but offline D = { θ 1 , . . . , θ K } Online learning of mixtures ◮ For a new pointset ◮ For each observation arriving: arg max θ ∈D p F ( x j , θ ) or arg min θ ∈D B F ( t ( x j ) , θ ) 16 / 20
Information Geometry for mixtures Algorithm Co-Mixture Models Experiments Bag of components Nearest neighbor search Naive version ◮ Linear search ◮ O ( number of samples × number of components ) ◮ Same order of magnitude as one step of EM Improvement ◮ Computational Bregman Geometry to speed-up the search ◮ Bregman Ball Trees ◮ Hierarchical clustering ◮ Approximate nearest neighbor 17 / 20
Information Geometry for mixtures Algorithm Co-Mixture Models Experiments Bag of components Image segmentation Segmentation on a random subset of the pixels 100% 10% 1% EM BoC 18 / 20
Information Geometry for mixtures Algorithm Co-Mixture Models Experiments Bag of components Computation times 120 Training 100 EM BoC 80 60 40 20 0 Training 100% 10% 1% 19 / 20
Information Geometry for mixtures Algorithm Co-Mixture Models Experiments Bag of components Summary Comix ◮ Mixtures with shared components ◮ Compact description of a lot of mixtures ◮ Fast KL approximations ◮ Dictionary-like methods Bag of Components ◮ Online method ◮ Predictable time (no iteration) ◮ Works with only a few points ◮ Fast 20 / 20
Recommend
More recommend