EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling 4 The Hidden Markov Model (HMM) Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 1
Signal template matching 1 • Framewise comparison of unknown word and stored templates: 70 FIVE 60 50 FOUR Reference 40 ONE TWO THREE 30 20 10 10 20 30 40 50 time /frames Test - distance metric? - comparison between templates? - constraints? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 2
Dynamic Time Warp (DTW) • Find lowest-cost constrained path: - matrix d(i,j) of distances between input frame f i and reference frame r j - allowable predecessors & transition costs T xy Lowest cost to (i,j) D(i,j) = d(i,j) + min { } D(i-1,j) + T 10 Reference frames r j T 10 D(i,j-1) + T 01 D(i-1,j) D(i-1,j-1) + T 11 T 01 1 Local match cost 1 T Best predecessor D(i-1,j) D(i-1,j) (including transition cost) Input frames f i • Best path via traceback from final state - have to store predecessors for (almost) every (i,j) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 3
DTW-based recognition • Reference templates for each possible word • Isolated word: - mark endpoints of input word - calculate scores through each template (+prune) - choose best • Continuous speech - one matrix of template slices; special-case constraints at word ends FOUR Reference ONE TWO THREE Input frames E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 4
DTW-based recognition (2) + Successfully handles timing variation + Able to recognize speech at reasonable cost - Distance metric? - pseudo-Euclidean space? - Warp penalties? - How to choose templates? - several templates per word? - choose ‘most representative’? - align and average? → need a rigorous foundation... E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 5
Outline 1 Signal template matching 2 Statistical sequence recognition - state-based modeling 3 Acoustic modeling 4 The Hidden Markov Model (HMM) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 6
Statistical sequence recognition 2 • DTW limited because it’s hard to optimize - interpretation of distance, transition costs? • Need a theoretical foundation: Probability • Formulate as MAP choice among models: M * p M j X Θ ( , ) = argmax M j - = observed features X - = word-sequence models M j Θ - = all current parameters E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 7
Statistical formulation (2) • Can rearrange via Bayes’ rule (& drop ( ) ): p X M * p M j X Θ ( , ) = argmax M j p X M j Θ A ( , ) p M j Θ L ( ) = argmax M j - ( | ) = likelihood of obs’v’ns under model p X M j - ( ) = prior probability of model p M j Θ - = acoustics-related model parameters A Θ - = language-related model parameters L • Questions: p X M j Θ A ( , ) - what form of model to use for ? Θ - how to find (training)? A - how to solve for (decoding)? M j E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 8
State-based modeling • Assume discrete-state model for the speech: - observations are divided up into time frames → → - model states observations: Model M j states Qk : q 1 q 2 q 3 q 4 q 5 q 6 ... time N X 1 : x 2 x 3 x 4 x 5 x 6 ... observed feature x 1 vectors • Probability of observations given model is: N Q k M j ∑ ( ) ( , ) ⋅ ( ) = p X M j p X 1 p Q k M j all Q k - sum over all possible state sequences Q k • How do observations depend on states? How do state sequences depend on model? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 9
The speech recognition chain • After classification, still have problem of classifying the sequences of frames: sound Feature calculation feature vectors Acoustic Network classifier weights phone probabilities Word models HMM Language model decoder phone & word labeling • Questions - what to use for the acoustic classifier? - how to represent ‘model’ sequences? - how to score matches? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 10
Outline 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling - defining targets - neural networks & Gaussian models 4 The Hidden Markov Model (HMM) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 11
Acoustic Modeling 3 • Goal: Convert features into probabilities of particular labels: i X n ( ) over some state set { q i } p q n i.e find - conventional statistical classification problem • Classifier construction is data-driven - assume we can get examples of known good X s for each of the q i s - calculate model parameters by standard training scheme • Various classifiers can be used - GMMs model distribution under each state - Neural Nets directly estimate posteriors • Different classifiers have different properties - features, labels limit ultimate performance E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 12
Defining classifier targets Choice of { q i } can make a big difference • - must support recognition task - must be a practical classification task • Hand-labeling is one source... - ‘experts’ mark spectrogram boundaries • ...Forced alignment is another - ‘best guess’ with existing classifiers, given words • Result is targets for each training frame: Feature vectors Training g w eh n targets time E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 13
Forced alignment • Best labeling given existing classifier constrained by known word sequence Feature vectors time Existing classifier ow th r iy n s Phone posterior Known word probabilities sequence Constrained Dictionary ow th r iy ... alignment Training ow th r iy targets Classifier training E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 14
Gaussian Mixture Models vs. Neural Nets • GMMs fit distribution of features under states: - separate ‘likelihood’ model for each state q i 1 1 – 1 p x q k ) T Σ k µ k µ ( ) ⋅ ( µ µ ( µ k ) = - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - exp – - - - x – – d Σ k x ⁄ 2 1 2 ( 2 π ) - match any distribution given enough data • Neural nets estimate posteriors directly ∑ ∑ p q k x ( ) [ ⋅ [ ] ] = w jk F w ij x i F j j - parameters set to discriminate classes • Posteriors & likelihoods related by Bayes’ rule: p x q k ) Pr q k ( ⋅ ( ) p q k x ( ) = - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ∑ ( ) Pr q ⋅ ( ) j j p x q j E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 15
Outline 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic classification 4 The Hidden Markov Model (HMM) - generative Markov models - hidden Markov models - model fit likelihood - HMM examples E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 16
Markov models 3 • A (first order) Markov model is a finite-state system whose behavior depends only on the current state • E.g. generative Markov model: .8 .8 q n +1 S .1 p ( q n +1 | q n ) S A B C E A B .1 0 1 0 0 0 S .1 0 .8 .1 .1 0 A .1 .1 .1 q n B 0 .1 .8 .1 0 C 0 .1 .1 .7 .1 C E 0 0 0 0 1 .1 E .7 S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 17
Hidden Markov models • Markov models where state sequence Q = { q n } is not directly observable (= ‘hidden’) • But, observations X do depend on Q : ( ) p x q - x n is rv that depends on current state: State sequence Emission distributions AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC q = A q = B q = C Observation p ( x | q ) 0.8 3 sequence 0.6 x n 2 0.4 1 0.2 0 0 p ( x | q ) 0.8 q = A q = B q = C 3 0.6 x n 2 0.4 1 0.2 0 0 0 10 20 30 0 1 2 3 4 observation x time step n - can still tell something about state seq... E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 18
(Generative) Markov models (2) • HMM is specified by: j q n ( i ) ≡ p q n a ij - transition probabilities – 1 ( i ) ≡ π i p q 1 - (initial state probabilities ) ( i ) ≡ ( ) p x q b i x - emission distributions - states q i k a t • • k a t • • 1.0 0.0 0.0 0.0 - transition k 0.9 0.1 0.0 0.0 • k a t • probabilities a ij a 0.0 0.9 0.1 0.0 t 0.0 0.0 0.9 0.1 k a t • • - emission distributions b i ( x ) p ( x | q ) x E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 19
Markov models for speech • Speech models M j - typ. left-to-right HMMs (sequence constraint) - observation & evolution are conditionally independent of rest given (hidden) state q n q 1 q 2 q 3 q 4 q 5 ae 1 ae 2 ae 3 S E x 1 x 2 x 3 x 4 x 5 - self-loops for time dilation E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 20
Recommend
More recommend