Lecture 3 Gaussian Mixture Models and Introduction to HMMs Michael - PowerPoint PPT Presentation

Example: Diagonal Covariance 1 1033 ( 74 . 34 + 73 . 92 + 72 . 01 + · · · ) = 73 . 71 µ 1 = 1 µ 2 = 1033 ( 181 . 29 + 213 . 79 + 209 . 52 + · · · ) = 201 . 69 1 ( 74 . 34 − 73 . 71 ) 2 + ( 73 . 92 − 73 . 71 ) 2 + · · · ) σ 2 � � 1 = 1033 = 5 . 43 1 ( 181 . 29 − 201 . 69 ) 2 + ( 213 . 79 − 201 . 69 ) 2 + · · · ) σ 2 � � 2 = 1033 = 440 . 62 39 / 106

Example: Diagonal Covariance 300 260 220 180 140 66 70 74 78 82 40 / 106

Example: Full Covariance Mean; diagonal elements of covariance matrix the same. Σ 12 = Σ 21 1 1033 [( 74 . 34 − 73 . 71 ) × ( 181 . 29 − 201 . 69 )+ = ( 73 . 92 − 73 . 71 ) × ( 213 . 79 − 201 . 69 ) + · · · )] = 25 . 43 µ = [ 73 . 71 201 . 69 ] � � 5 . 43 25 . 43 � � Σ = � � 25 . 43 440 . 62 � � 41 / 106

Example: Full Covariance 300 260 220 180 140 66 70 74 78 82 42 / 106

Recap: Gaussians Lots of data “looks” Gaussian. Central limit theorem. ML estimation of Gaussians is easy. Count and normalize. In ASR, mostly use diagonal covariance Gaussians. Full covariance matrices have too many parameters. 43 / 106

Part II Gaussian Mixture Models 44 / 106

Problems with Gaussian Assumption 45 / 106

Problems with Gaussian Assumption Sample from MLE Gaussian trained on data on last slide. Not all data is Gaussian! 46 / 106

Problems with Gaussian Assumption What can we do? What about two Gaussians? P ( x ) = p 1 × N ( µ 1 , Σ 1 ) + p 2 × N ( µ 2 , Σ 2 ) where p 1 + p 2 = 1. 47 / 106

Gaussian Mixture Models (GMM’s) More generally, can use arbitrary number of Gaussians: 1 2 ( x − µ j ) T Σ − 1 ( 2 π ) d / 2 | Σ j | 1 / 2 e − 1 � ( x − µ j ) P ( x ) = p j j j where � j p j = 1 and all p j ≥ 0. Also called mixture of Gaussians. Can approximate any distribution of interest pretty well . . . If just use enough component Gaussians. 48 / 106

Example: Some Real Acoustic Data 49 / 106

Example: 10-component GMM (Sample) 50 / 106

Example: 10-component GMM ( µ ’s, σ ’s) 51 / 106

ML Estimation For GMM’s Given training data, how to estimate parameters . . . i.e. , the µ j , Σ j , and mixture weights p j . . . To maximize likelihood of data? No closed-form solution. Can’t just count and normalize. Instead, must use an optimization technique . . . To find good local optimum in likelihood. Gradient search Newton’s method Tool of choice: The Expectation-Maximization Algorithm. 52 / 106

Where Are We? The Expectation-Maximization Algorithm 1 Applying the EM Algorithm to GMM’s 2 53 / 106

Wake Up! This is another key thing to remember from course. Used to train GMM’s, HMM’s, and lots of other things. Key paper in 1977 by Dempster, Laird, and Rubin [2]; 43958 citations to date. "the innovative Dempster-Laird-Rubin paper in the Journal of the Royal Statistical Society received an enthusiastic discussion at the Royal Statistical Society meeting.....calling the paper "brilliant"" 54 / 106

What Does The EM Algorithm Do? Finds ML parameter estimates for models . . . With hidden variables. Iterative hill-climbing method. Adjusts parameter estimates in each iteration . . . Such that likelihood of data . . . Increases (weakly) with each iteration. Actually, finds local optimum for parameters in likelihood. 55 / 106

What is a Hidden Variable? A random variable that isn’t observed. Example: in GMMs, output prob depends on . . . The mixture component that generated the observation But you can’t observe it Important concept. Let’s discuss!!!! 56 / 106

Mixtures and Hidden Variables So, to compute prob of observed x , need to sum over . . . All possible values of hidden variable h : � � P ( x ) = P ( h , x ) = P ( h ) P ( x | h ) h h Consider probability distribution that is a mixture of Gaussians: � P ( x ) = p j N ( µ j , Σ j ) j Can be viewed as hidden model. h ⇔ Which component generated sample. P ( h ) = p j ; P ( x | h ) = N ( µ j , Σ j ) . � P ( h ) P ( x | h ) P ( x ) = h 57 / 106

The Basic Idea If nail down “hidden” value for each x i , . . . Model is no longer hidden! e.g. , data partitioned among GMM components. So for each data point x i , assign single hidden value h i . Take h i = arg max h P ( h ) P ( x i | h ) . e.g. , identify GMM component generating each point. Easy to train parameters in non-hidden models. Update parameters in P ( h ) , P ( x | h ) . e.g. , count and normalize to get MLE for µ j , Σ j , p j . Repeat! 58 / 106

The Basic Idea Hard decision: For each x i , assign single h i = arg max h P ( h , x i ) . . . With count 1. Test: what is P ( h , x i ) for Gaussian distribution? Soft decision: For each x i , compute for every h . . . the Posterior prob ˜ P ( h , x i ) P ( h | x i ) = h P ( h , x i ) . P Also called the “fractional count” e.g. , partition event across every GMM component. Rest of algorithm unchanged. 59 / 106

The Basic Idea, using more Formal Terminology Initialize parameter values somehow. For each iteration . . . Expectation step: compute posterior (count) of h for each x i . P ( h , x i ) ˜ P ( h | x i ) = � h P ( h , x i ) Maximization step: update parameters. Instead of data x i with hidden h , pretend . . . Non-hidden data where . . . (Fractional) count of each ( h , x i ) is ˜ P ( h | x i ) . 60 / 106

Example: Training a 2-component GMM Two-component univariate GMM; 10 data points. The data: x 1 , . . . , x 10 8 . 4 , 7 . 6 , 4 . 2 , 2 . 6 , 5 . 1 , 4 . 0 , 7 . 8 , 3 . 0 , 4 . 8 , 5 . 8 Initial parameter values: σ 2 σ 2 p 1 µ 1 p 2 µ 2 1 2 0.5 4 1 0.5 7 1 Training data; densities of initial Gaussians. 61 / 106

The E Step ˜ ˜ p 1 · N 1 p 2 · N 2 P ( 1 | x i ) P ( 2 | x i ) x i P ( x i ) 8.4 0.0000 0.0749 0.0749 0.000 1.000 7.6 0.0003 0.1666 0.1669 0.002 0.998 4.2 0.1955 0.0040 0.1995 0.980 0.020 2.6 0.0749 0.0000 0.0749 1.000 0.000 5.1 0.1089 0.0328 0.1417 0.769 0.231 4.0 0.1995 0.0022 0.2017 0.989 0.011 7.8 0.0001 0.1448 0.1450 0.001 0.999 3.0 0.1210 0.0001 0.1211 0.999 0.001 4.8 0.1448 0.0177 0.1626 0.891 0.109 5.8 0.0395 0.0971 0.1366 0.289 0.711 P ( h , x i ) h P ( h , x i ) = p h · N h ˜ P ( h | x i ) = h ∈ { 1 , 2 } � P ( x i ) 62 / 106

The M Step View: have non-hidden corpus for each component GMM. For h th component, have ˜ P ( h | x i ) counts for event x i . Estimating µ : fractional events. N N µ = 1 1 ˜ � � ⇒ µ h = P ( h | x i ) x i x i i ˜ N � P ( h | x i ) i = 1 i = 1 1 µ 1 = 0 . 000 + 0 . 002 + 0 . 980 + · · ·× ( 0 . 000 × 8 . 4 + 0 . 002 × 7 . 6 + 0 . 980 × 4 . 2 + · · · ) = 3 . 98 Similarly, can estimate σ 2 h with fractional events. 63 / 106

The M Step (cont’d) What about the mixture weights p h ? To find MLE, count and normalize! p 1 = 0 . 000 + 0 . 002 + 0 . 980 + · · · = 0 . 59 10 64 / 106

The End Result σ 2 σ 2 iter p 1 µ 1 p 2 µ 2 1 2 0 0.50 4.00 1.00 0.50 7.00 1.00 1 0.59 3.98 0.92 0.41 7.29 1.29 2 0.62 4.03 0.97 0.38 7.41 1.12 3 0.64 4.08 1.00 0.36 7.54 0.88 10 0.70 4.22 1.13 0.30 7.93 0.12 65 / 106

First Few Iterations of EM iter 0 iter 1 iter 2 66 / 106

Later Iterations of EM iter 2 iter 3 iter 10 67 / 106

Why the EM Algorithm Works x = ( x 1 , x 2 , . . . ) = whole training set; h = hidden. θ = parameters of model. Objective function for MLE: (log) likelihood. L ( θ ) = log P θ ( x ) = log P θ ( x , h ) − log P θ ( h | x ) Form expectation with respect to θ n , the estimate of θ on the n th estimation iteration: � � P θ n ( h | x ) log P θ ( x ) = P θ n ( h | x ) log P θ ( x , h ) h h � − P θ n ( h | x ) log P θ ( h | x ) h rewrite as : log P θ ( x ) = Q ( θ | θ n ) + H ( θ | θ n ) 68 / 106

Why the EM Algorithm Works log P θ ( x ) = Q ( θ | θ n ) + H ( θ | θ n ) What is Q ? In the Gaussian example above Q is just � P θ n ( h | x ) log p h N x ( µ h , Σ h ) h It can be shown (using Gibb’s inequality) that H ( θ | θ n ) ≥ H ( θ n | θ n ) for any θ � = θ n So that means that any choice of θ that increases Q will increase log P θ ( x ) . Typically we just pick θ to maximize Q altogether, can often be done in closed form. 69 / 106

The E Step Compute Q . 70 / 106

The M Step Maximize Q with respect to θ Then repeat - E/M, E/M till likelihood stops improving significantly. That’s the E-M algorithm in a nutshell! 71 / 106

Discussion EM algorithm is elegant and general way to . . . Train parameters in hidden models . . . To optimize likelihood. Only finds local optimum. Seeding is of paramount importance. 72 / 106

Where Are We? The Expectation-Maximization Algorithm 1 Applying the EM Algorithm to GMM’s 2 73 / 106

Another Example Data Set 74 / 106

Question: How Many Gaussians? Method 1 (most common): Guess! Method 2: Bayesian Information Criterion (BIC)[1]. Penalize likelihood by number of parameters. k {− 1 2 n j log | Σ j |} − Nk ( d + 1 � BIC ( C k ) = 2 d ( d + 1 )) j = 1 k = Gaussian components. d = dimension of feature vector. n j = data points for Gaussian j ; N = total data points. Discuss! 75 / 106

The Bayesian Information Criterion View GMM as way of coding data for transmission. Cost of transmitting model ⇔ number of params. Cost of transmitting data ⇔ log likelihood of data. Choose number of Gaussians to minimize cost. 76 / 106

Question: How To Initialize Parameters? Set mixture weights p j to 1 / k (for k Gaussians). Pick N data points at random and . . . Use them to seed initial values of µ j . Set all σ ’s to arbitrary value . . . Or to global variance of data. Extension: generate multiple starting points. Pick one with highest likelihood. 77 / 106

Another Way: Splitting Start with single Gaussian, MLE. Repeat until hit desired number of Gaussians: Double number of Gaussians by perturbing means . . . Of existing Gaussians by ± ǫ . Run several iterations of EM. 78 / 106

Question: How Long To Train? i.e. , how many iterations of EM? Guess. Look at performance on training data. Stop when change in log likelihood per event . . . Is below fixed threshold. Look at performance on held-out data. Stop when performance no longer improves. 79 / 106

The Data Set 80 / 106

Sample From Best 1-Component GMM 81 / 106

The Data Set, Again 82 / 106

20-Component GMM Trained on Data 83 / 106

20-Component GMM µ ’s, σ ’s 84 / 106

Acoustic Feature Data Set 85 / 106

5-Component GMM; Starting Point A 86 / 106

5-Component GMM; Starting Point B 87 / 106

5-Component GMM; Starting Point C 88 / 106

Solutions With Infinite Likelihood Consider log likelihood; two-component 1d Gaussian. − ( xi − µ 1 ) 2 − ( xi − µ 2 ) 2 N � � 1 1 � 2 σ 2 2 σ 2 √ √ ln p 1 e + p 2 e 1 2 2 πσ 1 2 πσ 2 i = 1 If µ 1 = x 1 , above reduces to � ( x 1 − µ 2 ) 2 � N 1 1 1 � 2 σ 2 √ √ ln + e + . . . 2 2 2 πσ 1 2 2 πσ 2 i = 2 which goes to ∞ as σ 1 → 0. Only consider finite local maxima of likelihood function. Variance flooring. Throw away Gaussians with “count” below threshold. 89 / 106

Recap GMM’s are effective for modeling arbitrary distributions. State-of-the-art in ASR for decades (though may be superseded by NNs at some point, discuss later in course) The EM algorithm is primary tool for training GMM’s (and lots of other things) Very sensitive to starting point. Initializing GMM’s is an art. 90 / 106

References S. Chen and P .S. Gopalakrishnan, “Clustering via the Bayesian Information Criterion with Applications in Speech Recognition”, ICASSP , vol. 2, pp. 645–648, 1998. A.P . Dempster, N.M. Laird, D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Stat. Society. Series B, vol. 39, no. 1, 1977. 91 / 106

What’s Next: Hidden Markov Models Replace DTW with probabilistic counterpart. Together, GMM’s and HMM’s comprise . . . Unified probabilistic framework. Old paradigm: w ∗ = arg min distance ( A ′ test , A ′ w ) w ∈ vocab New paradigm: w ∗ = arg max P ( A ′ test | w ) w ∈ vocab 92 / 106

Part III Introduction to Hidden Markov Models 93 / 106

Introduction to Hidden Markov Models The issue of weights in DTW. Interpretation of DTW grid as Directed Graph. Adding Transition and Output Probabilities to the Graph gives us an HMM! The three main HMM operations. 94 / 106

Another Issue with Dynamic Time Warping Weights are completely heuristic! Maybe we can learn weights from data? Take many utterances . . . 95 / 106

Learning Weights From Data For each node in DP path, count number of times move up ↑ right → and diagonally ր . Normalize number of times each direction taken by total number of times node was actually visited ( = C / N ) Take some constant times reciprocal as weight ( α N / C ) Example: particular node visited 100 times. Move ր 40 times; → 20 times; ↑ 40 times. Set weights to 2.5, 5, and 2.5, (or 1, 2, and 1). Point: weight distribution should reflect . . . Which directions are taken more frequently at a node. Weight estimation not addressed in DTW . . . But central part of Hidden Markov models . 96 / 106

DTW and Directed Graphs Take following Dynamic Time Warping setup: Let’s look at representation of this as directed graph: 97 / 106

DTW and Directed Graphs Another common DTW structure: As a directed graph: Can represent even more complex DTW structures . . . Resultant directed graphs can get quite bizarre. 98 / 106

Path Probabilities Let’s assign probabilities to transitions in directed graph: a ij is transition probability going from state i to state j , where � j a ij = 1. Can compute probability P of individual path just using transition probabilities a ij . 99 / 106

Path Probabilities It is common to reorient typical DTW pictures: Above only describes path probabilities associated with transitions. Also need to include likelihoods associated with observations . 100 / 106

Lecture 3 Gaussian Mixture Models and Introduction to HMMs Michael - PowerPoint PPT Presentation

Lecture 3 Gaussian Mixture Models and Introduction to HMMs Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

A Course in Applied Econometrics Outline Lecture 16 1. Introduction 2. Generalized Method

K-Means + GMMs Clustering Readings: EM, GMM Readings: Matt Gormley Murphy 25.5

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Generalized Method of Moments (GMM) Estimation Heino Bohn Nielsen 1 of 35 Outline (1)

Lecture 3: Euler-Equation Estimation Simon Gilchrist Boston Univerity and NBER EC 745 Fall,

Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke Sato Nakamasa Inoue, Yusuke

Optimal transport for Gaussian mixture models Yongxin Chen, Tryphon T. Georgiou and Allen

Direct Fitting of Gaussian Mixture Models Leonid Keselman , Martial Hebert Robotics Institute

Lecture 3 Gaussian Mixture Models and Introduction to HMMs Michael - PowerPoint PPT Presentation

Lecture 3 Gaussian Mixture Models and Introduction to HMMs Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

A Course in Applied Econometrics Outline Lecture 16 1. Introduction 2. Generalized Method

K-Means + GMMs Clustering Readings: EM, GMM Readings: Matt Gormley Murphy 25.5

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos &amp; Aarti Singh

Generalized Method of Moments (GMM) Estimation Heino Bohn Nielsen 1 of 35 Outline (1)

Lecture 3: Euler-Equation Estimation Simon Gilchrist Boston Univerity and NBER EC 745 Fall,

Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke Sato Nakamasa Inoue, Yusuke

Optimal transport for Gaussian mixture models Yongxin Chen, Tryphon T. Georgiou and Allen

Direct Fitting of Gaussian Mixture Models Leonid Keselman , Martial Hebert Robotics Institute

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh