Lecture 5 The Big Picture/Language Modeling Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 17 February 2016
Administrivia Slides posted before lecture may not match lecture. Lab 1 Not graded yet; will be graded by next lecture? Awards ceremony for evaluation next week. Grading: what’s up with the optional exercises? Lab 2 Due nine days from now (Friday, Feb. 26) at 6pm. Start early! Avail yourself of Piazza. 2 / 99
Feedback Clear (4); mostly clear (2); unclear (3). Pace: fast (3); OK (2). Muddiest: HMM’s in general (1); Viterbi (1); FB (1). Comments (2+ votes): want better/clearer examples (5) spend more time walking through examples (3) spend more time on high-level intuition before getting into details (3) good examples (2) 3 / 99
Celebrity Sighting New York Times 4 / 99
Part I The HMM/GMM Framework 5 / 99
Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 6 / 99
The Raw Data 0.5 0 −0.5 −1 0 0.5 1 1.5 2 2.5 4 x 10 What do we do with waveforms? 7 / 99
Front End Processing Convert waveform to features . 8 / 99
What Have We Gained? Time domain ⇒ frequency domain. Removed vocal-fold excitation. Made features independent. 9 / 99
ASR 1.0: Dynamic Time Warping 10 / 99
Computing the Distance Between Utterances Find “best” alignment between frames. Sum distances between aligned frames. Sum penalties for “weird” alignments. 11 / 99
ASR 2.0: The HMM/GMM Framework 12 / 99
Notation 13 / 99
How Do We Do Recognition? x test = test features; P ω ( x ) = word model. (answer) =??? (answer) = arg max P ω ( x test ) ω ∈ vocab Return the word whose model . . . Assigns the highest prob to the utterance. 14 / 99
Putting it All Together P ω ( x ) = ??? How do we actually train? How do we actually decode ? It’s a puzzlement by jubgo . Some rights reserved. 15 / 99
Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 16 / 99
So What’s the Model? P ω ( x ) =??? Frequency that word ω generates features x . Has something to do with HMM’s and GMM’s. Untitled by Daniel Oines . Some rights reserved. 17 / 99
A Word Is A Sequence of Sounds e.g. , the word ONE : W → AH → N . Phoneme inventory. AA AE AH AO AW AX AXR AY B BD CH D DD DH DX EH ER EY F G GD HH IH IX IY JH K KD L M N NG OW OY P PD R S SH T TD TH TS UH UW V W X Y Z ZH What sounds make up TWO ? What do we use to model sequences? 18 / 99
HMM, v1.0 Outputs on arcs, not states. What’s the problem? What are the outputs? 19 / 99
HMM, v2.0 What’s the problem? How many frames per phoneme? 20 / 99
HMM, v3.0 Are we done? 21 / 99
Concept: Alignment ⇔ Path Path through HMM ⇒ sequence of arcs, one per frame. Notation: A = a 1 · · · a T . a t = which arc generated frame t . 22 / 99
The Game Plan Express P ω ( x ) , the total prob of x . . . In terms of P ω ( x , A ) , the prob of a single path. How? � P ( x ) = (path prob) paths A � = P ( x , A ) paths A Sum over all paths. 23 / 99
How To Compute the Likelihood of a Path? Path: A = a 1 · · · a T . T � P ( x , A ) = (arc prob) × (output prob) t = 1 T � p a t × P ( � x t | a t ) = t = 1 Multiply arc, output probs along path. 24 / 99
What Do Output Probabilities Look Like? Mixture of diagonal-covariance Gaussians. � � P ( � x | a ) = (mixture wgt) (Gaussian for dim d ) comp j dim d � � = p a , j N ( x d ; µ a , j , d , σ a , j , d ) comp j dim d 25 / 99
The Full Model � P ( x ) = P ( x , A ) paths A T � � p a t × P ( � = x t | a t ) paths A t = 1 T � � � � N ( x t , d ; µ a t , j , d , σ 2 = p a t p a t , j a t , j , d ) paths A t = 1 comp j dim d p a — transition probability for arc a . p a , j — mixture weight, j th component of GMM on arc a . µ a , j , d — mean, d th dim, j th component, GMM on arc a . σ 2 a , j , d — variance, d th dim, j th component, GMM on arc a . 26 / 99
Pop Quiz What was the equation on the last slide? 27 / 99
Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 28 / 99
Training How to create model P ω ( x ) from examples x ω, 1 , x ω, 2 , . . . ? 29 / 99
What is the Goal of Training? To estimate parameters . . . To maximize likelihood of training data. Crossfit 0303 by Runar Eilertsen . Some rights reserved. 30 / 99
What Are the Model Parameters? p a — transition probability for arc a . p a , j — mixture weight, j th component of GMM on arc a . µ a , j , d — mean, d th dim, j th component, GMM on arc a . σ 2 a , j , d — variance, d th dim, j th component, GMM on arc a . 31 / 99
Warm-Up: Non-Hidden ML Estimation e.g. , Gaussian estimation, non-hidden Markov Models. How to do this? (Hint: ??? and ???.) parameter description statistic p a arc prob # times arc taken p a , j mixture wgt # times component used µ a , j , d mean x d σ 2 x 2 variance a , j , d d Count and normalize. i.e. , collect a statistic; divide by normalizer count. 32 / 99
How To Estimate Hidden Models? The EM algorithm ⇒ FB algorithm for HMM’s. Hill-climbing maximum-likelihood estimation. Uphill Struggle by Ewan Cross . Some rights reserved. 33 / 99
The EM Algorithm Expectation step. Using current model, compute posterior counts . . . Prob that thing occurred at time t . Maximization step. Like non-hidden MLE, except . . . Use fractional posterior counts instead of whole counts. Repeat. 34 / 99
E step: Calculating Posterior Counts e.g. , posterior count γ ( a , t ) of taking arc a at time t . γ ( a , t ) = P ( paths with arc a at time t ) P ( all paths ) 1 P ( x ) × P ( paths from start to src( a ) ) × = P ( arc a at time t ) × P ( paths from dst( a ) to end ) 1 P ( x ) × α ( src ( a ) , t − 1 ) × p a × P ( � = x t | a ) × β ( dst ( a ) , t ) Do Forward algorithm: α ( S , t ) , P ( x ) . Do Backward algorithm: β ( S , t ) . Read off posterior counts. 35 / 99
M step: Non-Hidden ML Estimation Count and normalize. Same stats as non-hidden, except normalizer is fractional. e.g. , arc prob p a � (count of a ) t γ ( a , t ) p a = src ( a ′ )= src ( a ) (count of a ′ ) = � � � t γ ( a ′ , t ) src ( a ′ )= src ( a ) e.g. , single Gaussian, mean µ a , d for dim d . � t γ ( a , t ) x t , d µ a , d = (mean weighted by γ ( a , t ) ) = � t γ ( a , t ) 36 / 99
Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 37 / 99
What is Decoding? (answer) = arg max P ω ( x test ) ω ∈ vocab 38 / 99
What Algorithm? (answer) = arg max P ω ( x test ) ω ∈ vocab For each word ω , how to compute P ω ( x test ) ? Forward or Viterbi algorithm. 39 / 99
What Are We Trying To Compute? � P ( x ) = P ( x , A ) paths A T � � = p a t × P ( � x t | a t ) paths A t = 1 T � � = (arc cost) paths A t = 1 40 / 99
Dynamic Programming Shortest path problem. T A � (answer) = min (edge length) paths A t = 1 Forward algorithm. T � � P ( x ) = (arc cost) paths A t = 1 Viterbi algorithm. T � P ( x ) ≈ max (arc cost) paths A t = 1 Any semiring will do. 41 / 99
Scaling How does decoding time scale with vocab size? 42 / 99
The One Big HMM Paradigm: Before 43 / 99
The One Big HMM Paradigm: After ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ How does this help us? 44 / 99
Pruning What is time complexity of Forward/Viterbi? How many values α ( S , t ) to fill? Idea: only fill k best cells at each frame. What is time complexity? How does this scale with vocab size? 45 / 99
How Does This Change Decoding? Run Forward/Viterbi once , on one big HMM . . . Instead of once for every word model. Same algorithm; different graph! ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ 46 / 99
Forward or Viterbi? What are we trying to compute? Total prob? Viterbi prob? Best word? ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ 47 / 99
Recovering the Word Identity ♦♥❡ t✇♦ t❤r❡❡ ❢♦✉r ☞✈❡ s✐① s❡✈❡♥ ❡✐❣❤t ♥✐♥❡ ③❡r♦ 48 / 99
Where Are We? Review from 10,000 Feet 1 The Model 2 Training 3 Decoding 4 Technical Details 5 49 / 99
Hyperparameters What is a hyperparameter ? A tunable knob or something adjustable . . . That can’t be estimated with “normal” training. Can you name some? Number of states in each word HMM. HMM topology. Number of GMM components. 50 / 99
Recommend
More recommend