efficient algorithms for estimating multi view mixture
play

Efficient algorithms for estimating multi-view mixture models - PowerPoint PPT Presentation

Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks Part 1.


  1. Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England

  2. Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks

  3. Part 1. Multi-view mixture models Multi-view mixture models Unsupervised learning and mixture models Multi-view mixture models Complexity barriers Multi-view method-of-moments Some applications and open questions Concluding remarks

  4. Unsupervised learning ◮ Many modern applications of machine learning : ◮ high-dimensional data from many diverse sources, ◮ but mostly unlabeled.

  5. Unsupervised learning ◮ Many modern applications of machine learning : ◮ high-dimensional data from many diverse sources, ◮ but mostly unlabeled. ◮ Unsupervised learning: extract useful info from this data. ◮ Disentangle sub-populations in data source. ◮ Discover useful representations for downstream stages of learning pipeline ( e.g. , supervised learning).

  6. Mixture models Simple latent variable model: mixture model h ∈ [ k ] := { 1 , 2 , . . . , k } (hidden) ; h x ∈ R d (observed) ; � � � � x Pr [ h = j ] = w j ; x � h ∼ P h ; so � x has a mixture distribution P ( � x ) = w 1 P 1 ( � x ) + w 2 P 2 ( � x ) + · · · + w k P k ( � x ) .

  7. Mixture models Simple latent variable model: mixture model h ∈ [ k ] := { 1 , 2 , . . . , k } (hidden) ; h x ∈ R d (observed) ; � � � � x Pr [ h = j ] = w j ; x � h ∼ P h ; so � x has a mixture distribution P ( � x ) = w 1 P 1 ( � x ) + w 2 P 2 ( � x ) + · · · + w k P k ( � x ) . Typical use : learn about constituent sub-populations ( e.g. , clusters) in data source.

  8. Multi-view mixture models Can we take advantage of diverse sources of information?

  9. Multi-view mixture models Can we take advantage of diverse sources of information? h ∈ [ k ] , h x 1 ∈ R d 1 ,� x 2 ∈ R d 2 , . . . ,� x ℓ ∈ R d ℓ . � � � · · · � x 1 x 2 x ℓ k = # components, ℓ = # views ( e.g. , audio, video, text). View 1: � x 1 ∈ R d 1 View 2: � x 2 ∈ R d 2 View 3: � x 3 ∈ R d 3

  10. Multi-view mixture models Can we take advantage of diverse sources of information? h ∈ [ k ] , h x 1 ∈ R d 1 ,� x 2 ∈ R d 2 , . . . ,� x ℓ ∈ R d ℓ . � � � · · · � x 1 x 2 x ℓ k = # components, ℓ = # views ( e.g. , audio, video, text). View 1: � x 1 ∈ R d 1 View 2: � x 2 ∈ R d 2 View 3: � x 3 ∈ R d 3

  11. Multi-view mixture models Multi-view assumption : Views are conditionally independent given the component. x 1 ∈ R d 1 x 2 ∈ R d 2 x 3 ∈ R d 3 View 1: � View 2: � View 3: � Larger k (# components): more sub-populations to disentangle. Larger ℓ (# views): more non-redundant sources of information.

  12. Semi-parametric estimation task “Parameters” of component distributions : Mixing weights w j := Pr [ h = j ] , j ∈ [ k ]; x v | h = j ] ∈ R d v , Conditional means � µ v , j := E [ � j ∈ [ k ] , v ∈ [ ℓ ] . Goal : Estimate mixing weights and conditional means from independent copies of ( � x 1 ,� x 2 , . . . ,� x ℓ ) .

  13. Semi-parametric estimation task “Parameters” of component distributions : Mixing weights w j := Pr [ h = j ] , j ∈ [ k ]; x v | h = j ] ∈ R d v , Conditional means � µ v , j := E [ � j ∈ [ k ] , v ∈ [ ℓ ] . Goal : Estimate mixing weights and conditional means from independent copies of ( � x 1 ,� x 2 , . . . ,� x ℓ ) . Questions : 1. How do we estimate { w j } and { � µ v , j } without observing h ? 2. How many views ℓ are sufficient to learn with poly ( k ) computational / sample complexity?

  14. Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem.

  15. Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier : discrete HMM pa- rameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) .

  16. Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier : discrete HMM pa- rameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) . Statistical barrier : Gaussian mixtures in R 1 can require exp (Ω( k )) samples to estimate parameters, even if components are well- separated (Moitra-Valiant, ’10) .

  17. Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier : discrete HMM pa- rameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) . Statistical barrier : Gaussian mixtures in R 1 can require exp (Ω( k )) samples to estimate parameters, even if components are well- separated (Moitra-Valiant, ’10) . In practice : resort to local search ( e.g. , EM), often subject to slow convergence and inaccurate local optima.

  18. Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j

  19. Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) : interpoint distance-based methods / EM (Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00) ◮ sep = Ω( k c ) : first use PCA to k dimensions (Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05) ◮ Also works for mixtures of log-concave distributions.

  20. Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) : interpoint distance-based methods / EM (Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00) ◮ sep = Ω( k c ) : first use PCA to k dimensions (Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05) ◮ Also works for mixtures of log-concave distributions. ◮ No minimum separation requirement: method-of-moments but exp (Ω( k )) running time / sample size (Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)

  21. Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions. 0.6 1 2 3 4 5 6 7 8 ≈ +0.4 1 2 3 4 5 6 7 8 Pr [ � x t = ·| h t = 1 ] 1 2 3 4 5 6 7 8 0 . 6 Pr [ � x t = ·| h t = 2 ] + 0 . 4 Pr [ � x t = ·| h t = 3 ]

  22. Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions. 0.6 1 2 3 4 5 6 7 8 ≈ +0.4 1 2 3 4 5 6 7 8 Pr [ � x t = ·| h t = 1 ] 1 2 3 4 5 6 7 8 0 . 6 Pr [ � x t = ·| h t = 2 ] + 0 . 4 Pr [ � x t = ·| h t = 3 ] These instances are avoided by assuming parameter matrices are full-rank (Mossel-Roch, ’06; Hsu-Kakade-Zhang, ’09)

  23. What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation.

  24. What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. ◮ Non-degeneracy condition for multi-view mixture model: Conditional means { � µ v , 1 , � µ v , 2 , . . . , � µ v , k } are linearly w > � independent for each view v ∈ [ ℓ ] , and � 0. Requires high-dimensional observations ( d v ≥ k )!

  25. What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. ◮ Non-degeneracy condition for multi-view mixture model: Conditional means { � µ v , 1 , � µ v , 2 , . . . , � µ v , k } are linearly w > � independent for each view v ∈ [ ℓ ] , and � 0. Requires high-dimensional observations ( d v ≥ k )! ◮ New efficient learning guarantees for parametric models ( e.g. , mixtures of Gaussians, general HMMs)

  26. What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. ◮ Non-degeneracy condition for multi-view mixture model: Conditional means { � µ v , 1 , � µ v , 2 , . . . , � µ v , k } are linearly w > � independent for each view v ∈ [ ℓ ] , and � 0. Requires high-dimensional observations ( d v ≥ k )! ◮ New efficient learning guarantees for parametric models ( e.g. , mixtures of Gaussians, general HMMs) ◮ General tensor decomposition framework applicable to a wide variety of estimation problems.

  27. Part 2. Multi-view method-of-moments Multi-view mixture models Multi-view method-of-moments Overview Structure of moments Uniqueness of decomposition Computing the decomposition Asymmetric views Some applications and open questions Concluding remarks

  28. The plan ◮ First, assume views are (conditionally) exchangeable , and derive basic algorithm.

  29. The plan ◮ First, assume views are (conditionally) exchangeable , and derive basic algorithm. ◮ Then, provide reduction from general multi-view setting to exchangeable case. − →

Recommend


More recommend