EE E6820: Speech & Audio Processing & Recognition Lecture - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling 4 The Hidden Markov Model (HMM) Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 1

Signal template matching 1 • Framewise comparison of unknown word and stored templates: 70 FIVE 60 50 FOUR Reference 40 ONE TWO THREE 30 20 10 10 20 30 40 50 time /frames Test - distance metric? - comparison between templates? - constraints? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 2

Dynamic Time Warp (DTW) • Find lowest-cost constrained path: - matrix d(i,j) of distances between input frame f i and reference frame r j - allowable predecessors & transition costs T xy Lowest cost to (i,j) D(i,j) = d(i,j) + min { } D(i-1,j) + T 10 Reference frames r j T 10 D(i,j-1) + T 01 D(i-1,j) D(i-1,j-1) + T 11 T 01 1 Local match cost 1 T Best predecessor D(i-1,j) D(i-1,j) (including transition cost) Input frames f i • Best path via traceback from final state - have to store predecessors for (almost) every (i,j) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 3

DTW-based recognition • Reference templates for each possible word • Isolated word: - mark endpoints of input word - calculate scores through each template (+prune) - choose best • Continuous speech - one matrix of template slices; special-case constraints at word ends FOUR Reference ONE TWO THREE Input frames E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 4

DTW-based recognition (2) + Successfully handles timing variation + Able to recognize speech at reasonable cost - Distance metric? - pseudo-Euclidean space? - Warp penalties? - How to choose templates? - several templates per word? - choose ‘most representative’? - align and average? → need a rigorous foundation... E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 5

Outline 1 Signal template matching 2 Statistical sequence recognition - state-based modeling 3 Acoustic modeling 4 The Hidden Markov Model (HMM) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 6

Statistical sequence recognition 2 • DTW limited because it’s hard to optimize - interpretation of distance, transition costs? • Need a theoretical foundation: Probability • Formulate as MAP choice among models: M * p M j X Θ ( , ) = argmax M j - = observed features X - = word-sequence models M j Θ - = all current parameters E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 7

Statistical formulation (2) • Can rearrange via Bayes’ rule (& drop ( ) ): p X M * p M j X Θ ( , ) = argmax M j p X M j Θ A ( , ) p M j Θ L ( ) = argmax M j - ( | ) = likelihood of obs’v’ns under model p X M j - ( ) = prior probability of model p M j Θ - = acoustics-related model parameters A Θ - = language-related model parameters L • Questions: p X M j Θ A ( , ) - what form of model to use for ? Θ - how to find (training)? A - how to solve for (decoding)? M j E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 8

State-based modeling • Assume discrete-state model for the speech: - observations are divided up into time frames → → - model states observations: Model M j states Qk : q 1 q 2 q 3 q 4 q 5 q 6 ... time N X 1 : x 2 x 3 x 4 x 5 x 6 ... observed feature x 1 vectors • Probability of observations given model is: N Q k M j ∑ ( ) ( , ) ⋅ ( ) = p X M j p X 1 p Q k M j all Q k - sum over all possible state sequences Q k • How do observations depend on states? How do state sequences depend on model? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 9

The speech recognition chain • After classification, still have problem of classifying the sequences of frames: sound Feature calculation feature vectors Acoustic Network classifier weights phone probabilities Word models HMM Language model decoder phone & word labeling • Questions - what to use for the acoustic classifier? - how to represent ‘model’ sequences? - how to score matches? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 10

Outline 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling - defining targets - neural networks & Gaussian models 4 The Hidden Markov Model (HMM) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 11

Acoustic Modeling 3 • Goal: Convert features into probabilities of particular labels: i X n ( ) over some state set { q i } p q n i.e find - conventional statistical classification problem • Classifier construction is data-driven - assume we can get examples of known good X s for each of the q i s - calculate model parameters by standard training scheme • Various classifiers can be used - GMMs model distribution under each state - Neural Nets directly estimate posteriors • Different classifiers have different properties - features, labels limit ultimate performance E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 12

Defining classifier targets Choice of { q i } can make a big difference • - must support recognition task - must be a practical classification task • Hand-labeling is one source... - ‘experts’ mark spectrogram boundaries • ...Forced alignment is another - ‘best guess’ with existing classifiers, given words • Result is targets for each training frame: Feature vectors Training g w eh n targets time E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 13

Forced alignment • Best labeling given existing classifier constrained by known word sequence Feature vectors time Existing classifier ow th r iy n s Phone posterior Known word probabilities sequence Constrained Dictionary ow th r iy ... alignment Training ow th r iy targets Classifier training E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 14

Gaussian Mixture Models vs. Neural Nets • GMMs fit distribution of features under states: - separate ‘likelihood’ model for each state q i 1 1 – 1 p x q k ) T Σ k µ k µ ( ) ⋅ ( µ µ ( µ k ) = - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - exp – - - - x – – d Σ k x ⁄ 2 1 2 ( 2 π ) - match any distribution given enough data • Neural nets estimate posteriors directly ∑ ∑ p q k x ( ) [ ⋅ [ ] ] = w jk F w ij x i F j j - parameters set to discriminate classes • Posteriors & likelihoods related by Bayes’ rule: p x q k ) Pr q k ( ⋅ ( ) p q k x ( ) = - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ∑ ( ) Pr q ⋅ ( ) j j p x q j E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 15

Outline 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic classification 4 The Hidden Markov Model (HMM) - generative Markov models - hidden Markov models - model fit likelihood - HMM examples E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 16

Markov models 3 • A (first order) Markov model is a finite-state system whose behavior depends only on the current state • E.g. generative Markov model: .8 .8 q n +1 S .1 p ( q n +1 | q n ) S A B C E A B .1 0 1 0 0 0 S .1 0 .8 .1 .1 0 A .1 .1 .1 q n B 0 .1 .8 .1 0 C 0 .1 .1 .7 .1 C E 0 0 0 0 1 .1 E .7 S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 17

Hidden Markov models • Markov models where state sequence Q = { q n } is not directly observable (= ‘hidden’) • But, observations X do depend on Q : ( ) p x q - x n is rv that depends on current state: State sequence Emission distributions AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC q = A q = B q = C Observation p ( x | q ) 0.8 3 sequence 0.6 x n 2 0.4 1 0.2 0 0 p ( x | q ) 0.8 q = A q = B q = C 3 0.6 x n 2 0.4 1 0.2 0 0 0 10 20 30 0 1 2 3 4 observation x time step n - can still tell something about state seq... E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 18

(Generative) Markov models (2) • HMM is specified by: j q n ( i ) ≡ p q n a ij - transition probabilities – 1 ( i ) ≡ π i p q 1 - (initial state probabilities ) ( i ) ≡ ( ) p x q b i x - emission distributions - states q i k a t • • k a t • • 1.0 0.0 0.0 0.0 - transition k 0.9 0.1 0.0 0.0 • k a t • probabilities a ij a 0.0 0.9 0.1 0.0 t 0.0 0.0 0.9 0.1 k a t • • - emission distributions b i ( x ) p ( x | q ) x E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 19

Markov models for speech • Speech models M j - typ. left-to-right HMMs (sequence constraint) - observation & evolution are conditionally independent of rest given (hidden) state q n q 1 q 2 q 3 q 4 q 5 ae 1 ae 2 ae 3 S E x 1 x 2 x 3 x 4 x 5 - self-loops for time dilation E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 20

EE E6820: Speech & Audio Processing & Recognition Lecture - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling 4 The Hidden Markov Model (HMM) Dan Ellis

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

EE E6820: Speech & Audio Processing & Recognition Lecture 7: Audio Compression &

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1

EE E6820: Speech & Audio Processing & Recognition Lecture 2: Acoustics 1 The wave

EE E6820: Speech & Audio Processing & Recognition Lecture 6: Music analysis and synthesis

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

CS 294-73 Software Engineering for Scientific Computing Lecture 3

A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2 , Tzong-Han Tsai 1,3 ,

Devices and Device Controllers A computer system contains a multitude of I/O devices and their

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 20: Memory

How to Write a Project Proposal What is the problem you are addressing? What is the context?

DIMVA 2019 On the Perils of Leaking Referrers in Online Collaboration Services Authors: Beliz

Apache Traffic Server & Lua Kit Chan (kichan@yahoo-inc.com) Agenda Intro Rationale

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

EE E6820: Speech & Audio Processing & Recognition Lecture - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling 4 The Hidden Markov Model (HMM) Dan Ellis

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 7: Audio Compression &amp;

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 8: Spatial sound 1 Spatial

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 4: Auditory Perception 1

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 2: Acoustics 1 The wave

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 6: Music analysis and synthesis

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

CS 294-73 Software Engineering for Scientific Computing Lecture 3

A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2 , Tzong-Han Tsai 1,3 ,

Devices and Device Controllers A computer system contains a multitude of I/O devices and their

Computer Organization &amp; Assembly Language Programming (CSE 2312) Lecture 20: Memory

How to Write a Project Proposal What is the problem you are addressing? What is the context?

DIMVA 2019 On the Perils of Leaking Referrers in Online Collaboration Services Authors: Beliz

Apache Traffic Server &amp; Lua Kit Chan (kichan@yahoo-inc.com) Agenda Intro Rationale

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

EE E6820: Speech & Audio Processing & Recognition Lecture 7: Audio Compression &

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1

EE E6820: Speech & Audio Processing & Recognition Lecture 2: Acoustics 1 The wave

EE E6820: Speech & Audio Processing & Recognition Lecture 6: Music analysis and synthesis

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 20: Memory

Apache Traffic Server & Lua Kit Chan (kichan@yahoo-inc.com) Agenda Intro Rationale