A Method of Moments for Mixture Models and Hidden Markov Models - PowerPoint PPT Presentation

A Method of Moments for Mixture Models and Hidden Markov Models Anima Anandkumar @ Daniel Hsu # Sham M. Kakade # @ University of California, Irvine # Microsoft Research, New England

Outline 1. Latent class models and parameter estimation 2. Multi-view method of moments 3. Some applications 4. Concluding remarks

1. Latent class models and parameter estimation

Latent class models / multi-view mixture models Random vectors � h ∈ { � e 1 ,� e 2 , . . . ,� e k } ∈ R k , � x 1 ,� x 2 , . . . ,� x ℓ ∈ R d . � h � � · · · � x 1 x 2 x ℓ

Latent class models / multi-view mixture models Random vectors � h ∈ { � e 1 ,� e 2 , . . . ,� e k } ∈ R k , � x 1 ,� x 2 , . . . ,� x ℓ ∈ R d . � h � � · · · � x 1 x 2 x ℓ ◮ Bags-of-words clustering model : k = number of topics, d = vocabulary size, � h = topic of document, � x 1 ,� x 2 , . . . ,� x ℓ ∈ { � e 1 ,� e 2 , . . . ,� e d } words in the document.

Latent class models / multi-view mixture models Random vectors � h ∈ { � e 1 ,� e 2 , . . . ,� e k } ∈ R k , � x 1 ,� x 2 , . . . ,� x ℓ ∈ R d . � h � � · · · � x 1 x 2 x ℓ ◮ Bags-of-words clustering model : k = number of topics, d = vocabulary size, � h = topic of document, � x 1 ,� x 2 , . . . ,� x ℓ ∈ { � e 1 ,� e 2 , . . . ,� e d } words in the document. ◮ Multi-view clustering : k = number of clusters, ℓ = number of views ( e.g. , audio, video, text); views assumed to be conditionally independent given the cluster.

Latent class models / multi-view mixture models Random vectors � h ∈ { � e 1 ,� e 2 , . . . ,� e k } ∈ R k , � x 1 ,� x 2 , . . . ,� x ℓ ∈ R d . � h � � · · · � x 1 x 2 x ℓ ◮ Bags-of-words clustering model : k = number of topics, d = vocabulary size, � h = topic of document, � x 1 ,� x 2 , . . . ,� x ℓ ∈ { � e 1 ,� e 2 , . . . ,� e d } words in the document. ◮ Multi-view clustering : k = number of clusters, ℓ = number of views ( e.g. , audio, video, text); views assumed to be conditionally independent given the cluster. ◮ Hidden Markov model : ( ℓ = 3) past, present, and future observations are conditionally independent given present hidden state.

Latent class models / multi-view mixture models Random vectors � h ∈ { � e 1 ,� e 2 , . . . ,� e k } ∈ R k , � x 1 ,� x 2 , . . . ,� x ℓ ∈ R d . � h � � · · · � x 1 x 2 x ℓ ◮ Bags-of-words clustering model : k = number of topics, d = vocabulary size, � h = topic of document, � x 1 ,� x 2 , . . . ,� x ℓ ∈ { � e 1 ,� e 2 , . . . ,� e d } words in the document. ◮ Multi-view clustering : k = number of clusters, ℓ = number of views ( e.g. , audio, video, text); views assumed to be conditionally independent given the cluster. ◮ Hidden Markov model : ( ℓ = 3) past, present, and future observations are conditionally independent given present hidden state. ◮ etc.

Parameter estimation task Model parameters : mixing weights and conditional means w j := Pr [ � h = � e j ] , j ∈ [ k ]; x v | � e j ] ∈ R d , µ v , j := E [ � h = � � v ∈ [ ℓ ] , j ∈ [ k ] . Goal : given i.i.d. copies of ( � x 1 ,� x 2 , . . . ,� x ℓ ) , estimate matrix of conditional means M v := [ � µ v , 1 | � µ v , 2 | · · · | � µ v , k ] for each view v ∈ [ ℓ ] , and mixing weights � w := ( w 1 , w 2 , . . . , w k ) .

Parameter estimation task Model parameters : mixing weights and conditional means w j := Pr [ � h = � e j ] , j ∈ [ k ]; x v | � e j ] ∈ R d , µ v , j := E [ � h = � � v ∈ [ ℓ ] , j ∈ [ k ] . Goal : given i.i.d. copies of ( � x 1 ,� x 2 , . . . ,� x ℓ ) , estimate matrix of conditional means M v := [ � µ v , 1 | � µ v , 2 | · · · | � µ v , k ] for each view v ∈ [ ℓ ] , and mixing weights � w := ( w 1 , w 2 , . . . , w k ) . Unsupervised learning, as � h is not observed.

Parameter estimation task Model parameters : mixing weights and conditional means w j := Pr [ � h = � e j ] , j ∈ [ k ]; x v | � e j ] ∈ R d , µ v , j := E [ � h = � � v ∈ [ ℓ ] , j ∈ [ k ] . Goal : given i.i.d. copies of ( � x 1 ,� x 2 , . . . ,� x ℓ ) , estimate matrix of conditional means M v := [ � µ v , 1 | � µ v , 2 | · · · | � µ v , k ] for each view v ∈ [ ℓ ] , and mixing weights � w := ( w 1 , w 2 , . . . , w k ) . Unsupervised learning, as � h is not observed. This talk : very general and computationally efficient method-of-moments estimator for � w and M v .

Some barriers to efficient estimation Cryptographic barrier : HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) .

Some barriers to efficient estimation Cryptographic barrier : HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) . Statistical barrier : mixtures of Gaussians in R 1 can require exp (Ω( k )) samples to estimate, even if components are Ω( 1 / k ) - separated (Moitra-Valiant, ’10) .

Some barriers to efficient estimation Cryptographic barrier : HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) . Statistical barrier : mixtures of Gaussians in R 1 can require exp (Ω( k )) samples to estimate, even if components are Ω( 1 / k ) - separated (Moitra-Valiant, ’10) . Practitioners typically resort to local search heuristics (EM); plagued by slow convergence and inaccurate local optima.

Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j

Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) : interpoint distance-based methods / EM (Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00) ◮ sep = Ω( k c ) : first use PCA to k dimensions (Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05)

Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) : interpoint distance-based methods / EM (Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00) ◮ sep = Ω( k c ) : first use PCA to k dimensions (Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05) ◮ No minimum separation requirement: method-of-moments but exp (Ω( k )) running time / sample size (Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)

Making progress: hidden Markov models Hardness reductions create HMMs where different states may have near-identical output and next-state distributions. ≈ 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x t = ·| � x t = ·| � Pr [ � h t = � Pr [ � h t = � e 1 ] e 2 ] Can avoid these instances if we assume transition and output parameter matrices are full-rank.

Making progress: hidden Markov models Hardness reductions create HMMs where different states may have near-identical output and next-state distributions. ≈ 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x t = ·| � x t = ·| � Pr [ � h t = � Pr [ � h t = � e 1 ] e 2 ] Can avoid these instances if we assume transition and output parameter matrices are full-rank. ◮ d = k : eigenvalue decompositions (Chang, ’96; Mossel-Roch, ’06) ◮ d ≥ k : subspace ID + observable operator model (Hsu-Kakade-Zhang, ’09)

What we do This work : Concept of “full rank” parameter matrices is generic and very powerful; adapt Chang’s method for more general mixture models.

What we do This work : Concept of “full rank” parameter matrices is generic and very powerful; adapt Chang’s method for more general mixture models. ◮ Non-degeneracy condition for latent class model: M v has full column rank ( ∀ v ∈ [ ℓ ] ), and � w > 0.

What we do This work : Concept of “full rank” parameter matrices is generic and very powerful; adapt Chang’s method for more general mixture models. ◮ Non-degeneracy condition for latent class model: M v has full column rank ( ∀ v ∈ [ ℓ ] ), and � w > 0. ◮ New efficient learning results for: ◮ Certain Gaussian mixture models, with no minimum separation requirement and poly ( k ) sample / computational complexity ◮ HMMs with discrete or continuous output distributions ( e.g. , Gaussian mixture outputs)

2. Multi-view method of moments

Simplified model and low-order statistics Simplification: M v ≡ M (same conditional means for all views);

Simplified model and low-order statistics Simplification: M v ≡ M (same conditional means for all views); If � x v ∈ { � e 1 ,� e 2 , . . . ,� e d } (discrete outputs), then e i | � Pr [ � x v = � h = � e j ] = M i , j , i ∈ [ d ] , j ∈ [ k ] .

A Method of Moments for Mixture Models and Hidden Markov Models - PowerPoint PPT Presentation

A Method of Moments for Mixture Models and Hidden Markov Models Anima Anandkumar @ Daniel Hsu # Sham M. Kakade # @ University of California, Irvine # Microsoft Research, New England Outline 1. Latent class models and parameter estimation 2.

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

INTRODUCTION TO PROGRAMMING Using Arduino Disclaimer Many of these slides are mine

Orange Mockup Review Oct. 21 st , 2010 Select-a-Spice

Native Advertising and Content Marketing UDLS May 5, 2017 Neil Newman - Almost all content is

Outline for Today Wednesday, Dec. 5 Chapter 11: Intermolecular Forces and Liquids Phase

Distributional Hypothesis Zellig Harris: words that occur in the same contexts tend to have

XML Documents XML Documents The XML Namespace mechanism Anders Mller & Michael I.

Versatility of Singular Value Decomposition (SVD) January 7, 2015 Assumption : Data = Real Data +

Whole Numbers Jumping Jack Snap Game (numeral Cluedo Numerals! cards 0 to 20) Tell your child

A Method of Moments for Mixture Models and Hidden Markov Models - PowerPoint PPT Presentation

A Method of Moments for Mixture Models and Hidden Markov Models Anima Anandkumar @ Daniel Hsu # Sham M. Kakade # @ University of California, Irvine # Microsoft Research, New England Outline 1. Latent class models and parameter estimation 2.

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

INTRODUCTION TO PROGRAMMING Using Arduino Disclaimer Many of these slides are mine

Orange Mockup Review Oct. 21 st , 2010 Select-a-Spice

Native Advertising and Content Marketing UDLS May 5, 2017 Neil Newman - Almost all content is

Outline for Today Wednesday, Dec. 5 Chapter 11: Intermolecular Forces and Liquids Phase

Distributional Hypothesis Zellig Harris: words that occur in the same contexts tend to have

XML Documents XML Documents The XML Namespace mechanism Anders Mller &amp; Michael I.

Versatility of Singular Value Decomposition (SVD) January 7, 2015 Assumption : Data = Real Data +

Whole Numbers Jumping Jack Snap Game (numeral Cluedo Numerals! cards 0 to 20) Tell your child

XML Documents XML Documents The XML Namespace mechanism Anders Mller & Michael I.