Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions Daniel Hsu and Sham M. Kakade Microsoft Research, New England Also based on work with Anima Anandkumar (UCI) , Rong Ge (Princeton) , Matus Telgarsky (UCSD) . 1
Unsupervised machine learning ◮ Many applications in machine learning and statistics : ◮ Lots of high-dimensional data, but mostly unlabeled. 2
Unsupervised machine learning ◮ Many applications in machine learning and statistics : ◮ Lots of high-dimensional data, but mostly unlabeled. ◮ Unsupervised learning : discover interesting structure of population from unlabeled data. ◮ This talk : learn about sub-populations in data source. 2
Learning mixtures of Gaussians Mixture of Gaussians : � k i = 1 w i N ( � µ i , Σ i ) k sub-populations; each modeled as multivariate Gaussian N ( � µ i , Σ i ) together with mixing weight w i . 3
Learning mixtures of Gaussians Mixture of Gaussians : � k i = 1 w i N ( � µ i , Σ i ) k sub-populations; each modeled as multivariate Gaussian N ( � µ i , Σ i ) together with mixing weight w i . Goal: efficient algorithm that approximately recovers parameters from samples. 3
Learning mixtures of Gaussians Mixture of Gaussians : � k i = 1 w i N ( � µ i , Σ i ) k sub-populations; each modeled as multivariate Gaussian N ( � µ i , Σ i ) together with mixing weight w i . Goal: efficient algorithm that approximately recovers parameters from samples. (Alternative goal: density estimation. Not in this talk.) 3
Learning setup ◮ Input : i.i.d. sample S ⊂ R d from unknown mixtures of Gaussians with parameters θ ⋆ := { ( � µ i ⋆ , Σ ⋆ i , w i ⋆ ) : i ∈ [ k ] } . 4
Learning setup ◮ Input : i.i.d. sample S ⊂ R d from unknown mixtures of Gaussians with parameters θ ⋆ := { ( � µ i ⋆ , Σ ⋆ i , w i ⋆ ) : i ∈ [ k ] } . ◮ Each data point drawn from one of k Gaussians N ( � µ i ⋆ , Σ ⋆ i ) µ i ⋆ , Σ ⋆ i ) with probability w i ⋆ .) (choose N ( � 4
Learning setup ◮ Input : i.i.d. sample S ⊂ R d from unknown mixtures of Gaussians with parameters θ ⋆ := { ( � µ i ⋆ , Σ ⋆ i , w i ⋆ ) : i ∈ [ k ] } . ◮ Each data point drawn from one of k Gaussians N ( � µ i ⋆ , Σ ⋆ i ) µ i ⋆ , Σ ⋆ i ) with probability w i ⋆ .) (choose N ( � ◮ But “labels” are not observed . 4
Learning setup ◮ Input : i.i.d. sample S ⊂ R d from unknown mixtures of Gaussians with parameters θ ⋆ := { ( � µ i ⋆ , Σ ⋆ i , w i ⋆ ) : i ∈ [ k ] } . ◮ Each data point drawn from one of k Gaussians N ( � µ i ⋆ , Σ ⋆ i ) µ i ⋆ , Σ ⋆ i ) with probability w i ⋆ .) (choose N ( � ◮ But “labels” are not observed . ◮ Goal : estimate parameters θ = { ( � µ i , Σ i , w i ) : i ∈ [ k ] } such that θ ≈ θ ⋆ . 4
Learning setup ◮ Input : i.i.d. sample S ⊂ R d from unknown mixtures of Gaussians with parameters θ ⋆ := { ( � µ i ⋆ , Σ ⋆ i , w i ⋆ ) : i ∈ [ k ] } . ◮ Each data point drawn from one of k Gaussians N ( � µ i ⋆ , Σ ⋆ i ) µ i ⋆ , Σ ⋆ i ) with probability w i ⋆ .) (choose N ( � ◮ But “labels” are not observed . ◮ Goal : estimate parameters θ = { ( � µ i , Σ i , w i ) : i ∈ [ k ] } such that θ ≈ θ ⋆ . ◮ In practice : local search for maximum-likelihood parameters (E-M algorithm). 4
When are there efficient algorithms? Well-separated mixtures : estimation is easier if there is large minimum separation between component means (Dasgupta, ’99) : sep � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) or sep = Ω( k c ) : simple clustering methods, perhaps after dimension reduction (Dasgupta, ’99; Vempala-Wang, ’02; and many more.) 5
When are there efficient algorithms? Well-separated mixtures : estimation is easier if there is large minimum separation between component means (Dasgupta, ’99) : sep � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) or sep = Ω( k c ) : simple clustering methods, perhaps after dimension reduction (Dasgupta, ’99; Vempala-Wang, ’02; and many more.) Recent developments : ◮ No minimum separation requirement, but current methods require exp (Ω( k )) running time / sample size (Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10) 5
Overcoming barriers to efficient estimation Information-theoretic barrier : R 1 Gaussian mixtures in can require exp (Ω( k )) samples to estimate parameters, even when components are well-separated (Moitra-Valiant, ’10) . 6
Overcoming barriers to efficient estimation Information-theoretic barrier : R 1 Gaussian mixtures in can require exp (Ω( k )) samples to estimate parameters, even when components are well-separated (Moitra-Valiant, ’10) . These hard instances are degenerate in high-dimensions! 6
Overcoming barriers to efficient estimation Information-theoretic barrier : R 1 Gaussian mixtures in can require exp (Ω( k )) samples to estimate parameters, even when components are well-separated (Moitra-Valiant, ’10) . These hard instances are degenerate in high-dimensions! Our result : efficient algorithms for non-degenerate models in high-dimensions ( d ≥ k ) with spherical covariances . 6
Main result Theorem ( H-Kakade, ’13) µ k ⋆ } linearly independent, w i ⋆ > 0 for µ 1 ⋆ , � µ 2 ⋆ , . . . , � Assume { � all i ∈ [ k ] , and Σ ⋆ i = σ 2 ⋆ I for all i ∈ [ k ] . i There is an algorithm that, given independent draws from a mixture of k spherical Gaussians, returns ε -accurate parameters (up to permutation, under ℓ 2 metric) w.h.p. The running time and sample complexity are poly ( d , k , 1 /ε, 1 / w min , 1 /λ min ) where λ min = k th -largest singular value of [ � µ 1 ⋆ | � µ 2 ⋆ | · · · | � µ k ⋆ ] . (Also using new techniques from Anandkumar-Ge-H-Kakade-Telgarsky, ’12.) 7
2. Learning algorithm Introduction Learning algorithm Method-of-moments Choice of moments Solving the moment equations Concluding remarks 8
Method-of-moments Let S ⊂ R d be an i.i.d. sample from an unknown mixture of spherical Gaussians: k � ⋆ N ( � ⋆ , σ 2 ⋆ I ) . � x ∼ w i µ i i i = 1 9
Method-of-moments Let S ⊂ R d be an i.i.d. sample from an unknown mixture of spherical Gaussians: k � ⋆ N ( � ⋆ , σ 2 ⋆ I ) . � x ∼ w i µ i i i = 1 Estimation via method-of-moments (Pearson, 1894) Find parameters θ such that x ) ] ≈ ˆ E θ [ p ( � x ∈ S [ p ( � x ) ] E � for some functions p : R d → R (typically multivar. polynomials) . 9
Method-of-moments Let S ⊂ R d be an i.i.d. sample from an unknown mixture of spherical Gaussians: k � ⋆ N ( � ⋆ , σ 2 ⋆ I ) . � x ∼ w i µ i i i = 1 Estimation via method-of-moments (Pearson, 1894) Find parameters θ such that x ) ] ≈ ˆ E θ [ p ( � x ∈ S [ p ( � x ) ] E � for some functions p : R d → R (typically multivar. polynomials) . Q1 Which moments to use? Q2 How to (approx.) solve moment equations? 9
Which moments to use? 10
Which moments to use? moment order reliable estimates? unique solution? 1 st , 2 nd 1 st - and 2 nd -order moments ( e.g. , mean, covariance) [Chaudhuri-Rao, ’08] [Achlioptas-McSherry, ’05] [Vempala-Wang, ’02] 1 st 2 nd Ω( k ) th order of moments 10
Which moments to use? moment order reliable estimates? unique solution? 1 st , 2 nd ✓ 1 st - and 2 nd -order moments ( e.g. , mean, covariance) ◮ Fairly easy to get reliable estimates. x ∈ S [ � x ⊗ � x ] ≈ E θ ⋆ [ � x ⊗ � E � x ] [Chaudhuri-Rao, ’08] [Achlioptas-McSherry, ’05] [Vempala-Wang, ’02] 1 st 2 nd Ω( k ) th order of moments 10
Which moments to use? moment order reliable estimates? unique solution? 1 st , 2 nd ✓ ✗ 1 st - and 2 nd -order moments ( e.g. , mean, covariance) ◮ Fairly easy to get reliable estimates. x ∈ S [ � x ⊗ � x ] ≈ E θ ⋆ [ � x ⊗ � E � x ] ◮ Can have multiple solutions to moment equations. E θ 1 [ � x ⊗ � x ∈ S [ � x ⊗ � x ] ≈ E θ 2 [ � x ⊗ � x ] ≈ E � x ] , θ 1 � = θ 2 [Chaudhuri-Rao, ’08] [Achlioptas-McSherry, ’05] [Vempala-Wang, ’02] 1 st 2 nd Ω( k ) th order of moments 10
Which moments to use? moment order reliable estimates? unique solution? 1 st , 2 nd ✓ ✗ Ω( k ) th Ω( k ) th -order moments ( e.g. , E θ [ degree- k -poly ( � x )] ) [Belkin-Sinha, ’10] [Chaudhuri-Rao, ’08] [Moitra-Valiant, ’10] [Achlioptas-McSherry, ’05] [Lindsay, ’89] [Vempala-Wang, ’02] [Prony, 1795] 1 st 2 nd Ω( k ) th order of moments 10
Which moments to use? moment order reliable estimates? unique solution? 1 st , 2 nd ✓ ✗ Ω( k ) th ✓ Ω( k ) th -order moments ( e.g. , E θ [ degree- k -poly ( � x )] ) ◮ Uniquely pins down the solution. [Belkin-Sinha, ’10] [Chaudhuri-Rao, ’08] [Moitra-Valiant, ’10] [Achlioptas-McSherry, ’05] [Lindsay, ’89] [Vempala-Wang, ’02] [Prony, 1795] 1 st 2 nd Ω( k ) th order of moments 10
Recommend
More recommend