ELEN E6884/COMS 86884 Speech Recognition Lecture 7
Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, NY, USA {picheny,eeide,stanchen}@us.ibm.com 20 October 2005
■❇▼
ELEN E6884: Speech Recognition
ELEN E6884/COMS 86884 Speech Recognition Lecture 7 Michael - - PowerPoint PPT Presentation
ELEN E6884/COMS 86884 Speech Recognition Lecture 7 Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, NY, USA { picheny,eeide,stanchen } @us.ibm.com 20 October 2005 ELEN E6884: Speech
■❇▼
ELEN E6884: Speech Recognition
■ main feedback from last lecture
■ Ellen will be away for a while
■ sample answers to Lab 1 posted
■ Lab 2 due Sunday midnight ■ Lab 3 out Monday?
■❇▼
ELEN E6884: Speech Recognition 1
■ weeks 1–4: small vocabulary ASR ■ weeks 5–8: large vocabulary ASR
■ weeks 9–13: advanced topics
■❇▼
ELEN E6884: Speech Recognition 2
ω
ω
ω
■ P(x|ω) — acoustic model ■ P(ω) — language model
■❇▼
ELEN E6884: Speech Recognition 3
■ Unit I: you do not talk about Unit I ■ Unit II: acoustic model training for LVCSR ■ Unit III: decoding for LVCSR (inefficient)
■ Unit V: search (lecture 8)
■❇▼
ELEN E6884: Speech Recognition 4
■ small model
■ simple training recipe
■❇▼
ELEN E6884: Speech Recognition 5
■ single Gaussians ⇒ Gaussian mixture models (GMM’s) ■ isolated speech ⇒ continuous speech ■ word models ⇒ context-dependent (CD) phone models ■ 2500 Gaussian parameters ⇒ tens of millions of Gaussian
■ flat start and FB?
■❇▼
ELEN E6884: Speech Recognition 6
■ front end from lab 1; take first two dimensions; 546 frames
■❇▼
ELEN E6884: Speech Recognition 7
■ initialize mean of each Gaussian to 0, variance to 1 ■ what do you think will happen?
■❇▼
ELEN E6884: Speech Recognition 8
■ at the GMM level, symmetry is a bad idea.
■❇▼
ELEN E6884: Speech Recognition 9
■ picked 8 random starting points ⇒ 3 different optimum found ■ training is not simple even for simple models
■❇▼
ELEN E6884: Speech Recognition 10
■ example: GMM
■ example: HMM
■❇▼
ELEN E6884: Speech Recognition 11
■ finds “nearest” optimum to where you started ■ picking a good starting point is key ■ chicken and egg problem
likelihood parameter values
■❇▼
ELEN E6884: Speech Recognition 12
■ discuss training process in more depth ■ reveal strategies for finding ML parameter estimates for
■ in practice, training is tortuous multistage process
■❇▼
ELEN E6884: Speech Recognition 13
■ variance flooring
■ just as LM’s need to be smoothed or regularized
■❇▼
ELEN E6884: Speech Recognition 14
■ from word models; single Gaussians; isolated words . . . ■ to context-dependent phone models; GMM’s; continuous words
■❇▼
ELEN E6884: Speech Recognition 15
■ Phase 1: Collect underpants
■ Phase 2: Iterate over training data
■ Phase 3: Profit!
■❇▼
ELEN E6884: Speech Recognition 16
■ we believe there’s a huge local minima in the “middle” of the
■ another perspective
■❇▼
ELEN E6884: Speech Recognition 17
■ how can we train more complex models . . .
■ start with a model simple enough that a flat start works ■ then, can we use this simple model . . .
■ if so, can iteratively build more and more complex models ■ case study: training mixtures of Gaussians
■❇▼
ELEN E6884: Speech Recognition 18
■ start with single Gaussian per mixture (trained) ■ split each Gaussian into two
■ repeat until reach desired number of mixture components (1, 2,
■ assumption: n-component Gaussian mixture gives good hints
■❇▼
ELEN E6884: Speech Recognition 19
■ train single Gaussian
■❇▼
ELEN E6884: Speech Recognition 20
■ split each Gaussian in two (±0.2 ×
■❇▼
ELEN E6884: Speech Recognition 21
■ train, yep
■❇▼
ELEN E6884: Speech Recognition 22
■ split each Gaussian in two (±0.2 ×
■❇▼
ELEN E6884: Speech Recognition 23
■ train, yep
■❇▼
ELEN E6884: Speech Recognition 24
■ train model where each output distribution is single Gaussian (`
■ split Gaussians in each output distribution simultaneously ■ train whole model with FB ■ repeat
■❇▼
ELEN E6884: Speech Recognition 25
■ instead of recursive divide-and-conquer method . . . ■ use clustering algorithm on data to find desired number of
■ (discard Gaussians with insufficient counts)
■❇▼
ELEN E6884: Speech Recognition 26
■ select desired number of clusters k ■ choose k data points randomly
■ “assign” each data point to nearest cluster center ■ recompute each cluster center as . . .
■ repeat until convergence
■❇▼
ELEN E6884: Speech Recognition 27
■ pick random cluster centers; assign each point to nearest center
■❇▼
ELEN E6884: Speech Recognition 28
■ recompute cluster centers
■❇▼
ELEN E6884: Speech Recognition 29
■ assign each point to nearest center
■❇▼
ELEN E6884: Speech Recognition 30
■ repeat until convergence
■❇▼
ELEN E6884: Speech Recognition 31
■ use centers as means of Gaussians; train, yep
■❇▼
ELEN E6884: Speech Recognition 32
■❇▼
ELEN E6884: Speech Recognition 33
■ when using Euclidean distance to compute “nearest” center . . . ■ k-means clustering is equivalent to . . .
■❇▼
ELEN E6884: Speech Recognition 34
■ for each GMM/output distribution, use k-means clustering . . .
■ huh?
■ how can we compute this?
■❇▼
ELEN E6884: Speech Recognition 35
■ Viterbi algorithm
P1(x) P1(x) P2(x) P2(x) P3(x) P3(x) P4(x) P4(x) P5(x) P5(x) P6(x) P6(x)
■ need existing model to create alignment . . .
■❇▼
ELEN E6884: Speech Recognition 36
■ hidden models have local minima galore! ■ smaller models can help seed larger models
■ heuristics have been developed that work OK
■❇▼
ELEN E6884: Speech Recognition 37
■ train single Gaussian models (flat start; many iterations of FB) ■ do mixture splitting, say
■❇▼
ELEN E6884: Speech Recognition 38
■ single Gaussians ⇒ Gaussian mixture models (GMM’s) ■
■ word models ⇒ context-independent (CI) phone models ■ CI phone models ⇒ context-dependent (CD) phone models
■❇▼
ELEN E6884: Speech Recognition 39
■ isolated speech with word models
■ continuous speech
■ what to do?
■❇▼
ELEN E6884: Speech Recognition 40
■ do forced alignment
■ snip each utterance into individual words
■ what are possible issues with this approach?
■❇▼
ELEN E6884: Speech Recognition 41
■ instead of snipping the concatenated word HMM and snipping
■ does this make sense?
■❇▼
ELEN E6884: Speech Recognition 42
■ To do one iteration of FB, which strategy is faster?
■ Which strategy is less prone to local minima? ■ in practice, both styles of strategies are used
■❇▼
ELEN E6884: Speech Recognition 43
■ reference transcripts are created by humans . . .
■ typical transcripts don’t contain everything an ASR system
■ how can we correctly construct the HMM for an utterance?
■❇▼
ELEN E6884: Speech Recognition 44
■ that is, the human-produced transcript is incomplete
■ Viterbi decoding!
~SIL(01) THE(01) THE(02) ~SIL(01) DOG(01) DOG(02) DOG(03) ~SIL(01) ~SIL(01) THE(01) DOG(02) ~SIL(01)
■❇▼
ELEN E6884: Speech Recognition 45
■ train initial model without silence; single pronunciation per word ■ use HMM containing all alternatives directly in training (e.g., Lab
~SIL(01) THE(01) THE(02) ~SIL(01) DOG(01) DOG(02) DOG(03) ~SIL(01)
■❇▼
ELEN E6884: Speech Recognition 46
■ train an initial GMM system (Lab 2 stopped here)
■ use initial system to refine reference transcripts
■ do more FB on initial system or retrain from scratch
■❇▼
ELEN E6884: Speech Recognition 47
■ single Gaussians ⇒ Gaussian mixture models (GMM’s) ■ isolated speech ⇒ continuous speech ■
■ CI phone models ⇒ context-dependent (CD) phone models
■❇▼
ELEN E6884: Speech Recognition 48
■ reference transcript
THE DOG
■ replace each word with its HMM
THE1 THE2 THE3 THE4 DOG1 DOG2 DOG3 DOG4 DOG5 DOG6
■❇▼
ELEN E6884: Speech Recognition 49
■ reference transcript
THE DOG
■ pronunciation dictionary
DH AH D AO G
■ replace each phone with its HMM
DH1 DH2 AH1 AH2 D1 D2 AO1 AO2 G1 G2
■❇▼
ELEN E6884: Speech Recognition 50
■ need pronunciation of every word in training data
■ how the HMM for each training utterance is created
■❇▼
ELEN E6884: Speech Recognition 51
■ build pronunciation dictionary for all words in training set ■ train an initial GMM system ■ use initial system to refine reference transcripts ■ do more FB on initial system or retrain from scratch
■❇▼
ELEN E6884: Speech Recognition 52
■ single Gaussians ⇒ Gaussian mixture models (GMM’s) ■ isolated speech ⇒ continuous speech ■ word models ⇒ context-independent (CI) phone models ■
■❇▼
ELEN E6884: Speech Recognition 53
■ context-independent phone models
■ context-dependent models
■❇▼
ELEN E6884: Speech Recognition 54
■ not one decision tree per phoneme, but one per phoneme state
■ terminology
■❇▼
ELEN E6884: Speech Recognition 55
■ 39-dimensional feature vectors ⇒ ∼80 parameters/Gaussian ■ big models can have tens of millions of parameters
■❇▼
ELEN E6884: Speech Recognition 56
■ in a CI model, consider the GMM for a state, e.g., AH1
■ context-dependent modeling assumes . . .
■ what do we mean by better model? ■ how do we build this better model?
■❇▼
ELEN E6884: Speech Recognition 57
■ what do we mean by better model?
■ on what data?
■ how do we find this data?
■❇▼
ELEN E6884: Speech Recognition 58
■ forced alignment/Viterbi decoding! ■ where do we get the model to align with from?
DH1 DH2 AH1 AH2 D1 D2 AO1 AO2 G1 G2
■❇▼
ELEN E6884: Speech Recognition 59
■ build decision tree for AH1 to optimize likelihood of acoustic
■ the CD probability distribution: p(
■❇▼
ELEN E6884: Speech Recognition 60
■ one GMM per phone state
■ one GMM per phone state and triphone context (∼ 50 × 50)
■ cluster triphone contexts using decision tree
■❇▼
ELEN E6884: Speech Recognition 61
■ how can we seed the context-dependent GMM parameters?
■ once you have a good model for a domain
■❇▼
ELEN E6884: Speech Recognition 62
THE DOG DH AH D AO G
DH1 DH2 AH1 AH2 D1 D2 AO1 AO2 G1 G2
DH1,3 DH2,7 AH1,2 AH2,4 D1,3 D2,9 AO1,1 AO2,1 G1,2 G2,7
■❇▼
ELEN E6884: Speech Recognition 63
■ build CI model using previous recipe ■ use CI model to align training data
■ use CI model to seed CD model ■ train CD model using FB
■❇▼
ELEN E6884: Speech Recognition 64
■ adaptation (VTLN, fMLLR, mMLLR) ■ discriminative training (LDA, MMI, MPE, fMPE) ■ model combination (cross adaptation, ROVER) ■ iteration
■❇▼
ELEN E6884: Speech Recognition 65
ML-SAT-L ML-AD-L
ROVER
Consensus
rescoring 100-best rescoring 100-best 4-gram rescoring 4-gram rescoring 4-gram rescoring 4-gram rescoring 4-gram rescoring
Consensus Consensus Consensus Consensus Consensus
rescoring 100-best 4-gram rescoring 4-gram rescoring 4-gram rescoring 4-gram rescoring
Consensus Consensus Consensus
36.3%
MFCC ML-SAT-L VTLN ML-AD-L ML-SAT ML-AD MMI-SAT MMI-AD ML-SAT ML-AD MFCC-SI PLP VTLN MMI-SAT MMI-AD Consensus
4-gram 100-best rescoring rescoring 38.4% Eval’01 WER 35.6% 31.6% 30.3% 30.1% 30.5% 31.0% 32.1% 29.9% 31.1% 30.2% 28.8% 28.7% 31.4% 29.2% 27.8% 29.2% 29.5% 30.1% 29.8% 30.9% 31.9% 34.3% 42.6% 45.9% Eval’98 WER (SWB only) 34.0% 41.6% 39.3% 38.5% 37.7% 38.7% 38.1% 36.7% 38.7% 30.8% 37.9% 38.1% 37.1% 36.9% 35.9% 35.2% 35.7% 36.5% 38.1% 37.2% 35.5% 37.7%
■❇▼
ELEN E6884: Speech Recognition 66
■ take-home messages
■ the good news is . . .
■❇▼
ELEN E6884: Speech Recognition 67
ω
ω
ω
■ now that we know how to build models for LVCSR . . .
■ how can we use them for decoding?
■❇▼
ELEN E6884: Speech Recognition 68
■ take graph/FSA represent language model
LIKE UH
■ expand to underlying HMM
LIKE UH
■ run the Viterbi algorithm!
■❇▼
ELEN E6884: Speech Recognition 69
■ Issue 1: Can we express an n-gram model as an FSA?
h=w1 w1/P(w1|w1) h=w2 w2/P(w2|w1) w1/P(w1|w2) w2/P(w2|w2)
h=w1,w1 w1/P(w1|w1,w1) h=w1,w2 w2/P(w2|w1,w1) h=w2,w1 w1/P(w1|w1,w2) h=w2,w2 w2/P(w2|w1,w2) w1/P(w1|w2,w1) w2/P(w2|w2,w1) w1/P(w1|w2,w2) w2/P(w2|w2,w2)
■❇▼
ELEN E6884: Speech Recognition 70
■ probability assigned to path is LM probability of words along
■ do bigram example on board
■❇▼
ELEN E6884: Speech Recognition 71
■ how many states in the FSA representing an n-gram model . . .
■ how many arcs?
■❇▼
ELEN E6884: Speech Recognition 72
■ word models
■ CI phone models
h=LIKE LIKE/P(LIKE|LIKE) UH/P(UH|LIKE) h=UH LIKE/P(LIKE|UH) UH/P(UH|UH)
■❇▼
ELEN E6884: Speech Recognition 73
DH D AH AO G
■ how can we do context-dependent expansion?
■ example of triphone expansion
G_D_AO D_AO_G AO_G_D AO_G_DH G_DH_AH DH_AH_DH DH_AH_D AH_DH_AH AH_D_AO
■ other tricky cases
■❇▼
ELEN E6884: Speech Recognition 74
■ word-internal models
■ in graph expansion, just replace each word with its HMM
LIKE UH LIKE UH
■❇▼
ELEN E6884: Speech Recognition 75
■ is there some elegant theoretical framework . . . ■ that makes it easy to do this type of expansion . . . ■ and also makes it easy to do lots of other graph operations
■ ⇒ finite-state transducers (FST’s)! (Unit IV)
■❇▼
ELEN E6884: Speech Recognition 76
■ can do same thing we do for small vocabulary decoding
■ how to do the graph expansion? FST’s (Unit IV) ■ how to make decoding efficient? search (Unit V)
■❇▼
ELEN E6884: Speech Recognition 77
■ FST’s closely related to finite-state automata (FSA)
■ natural technology for graph expansion . . .
■ FST’s for ASR pioneered by AT&T in late 1990’s
■❇▼
ELEN E6884: Speech Recognition 78
■ it has states
■ it has arcs
■ ignore probabilities for now
1 2 a c 3 b a <epsilon>
■❇▼
ELEN E6884: Speech Recognition 79
■ What are the differences between the following:
■ Can they express the same class of models?
■❇▼
ELEN E6884: Speech Recognition 80
■ it’s like a finite-state acceptor, except . . . ■ each arc has two labels instead of one
1 2 a:<epsilon> c:c 3 b:a a:a <epsilon>:b
■❇▼
ELEN E6884: Speech Recognition 81
■ finite-state acceptor (FSA): one label on each arc ■ finite-state transducer (FST): input and output label on each arc ■ finite-state machine (FSM): FSA or FST
■ incidentally, an FSA can act like an FST
■❇▼
ELEN E6884: Speech Recognition 82
■ perspective: rewriting/transforming token sequences
1 2 a 3 b 4 d
1 2 a:A 3 b:B 4 d:D
1 2 A 3 B 4 D
■❇▼
ELEN E6884: Speech Recognition 83
1 2 a 3 b 4 d
1 a:A b:B c:C d:D
1 2 A 3 B 4 D
■❇▼
ELEN E6884: Speech Recognition 84
1 2 c d 6 b 3 a 5 a a 4 b d
1 a:A b:B c:C d:D
1 3 B 2 C D 4 A A 5 A 6 D B
■❇▼
ELEN E6884: Speech Recognition 85
■ for every complete path (from initial to final state) in A . . .
■ and for every complete path in T . . .
■ there is a complete path in A ◦ T . . .
■ we will discuss how to construct A ◦ T shortly
■❇▼
ELEN E6884: Speech Recognition 86
■ example 1: optional silence insertion in training graphs
1 2 C 3 A 4 B
1 <epsilon>:~SIL A:A B:B C:C
1 ~SIL 2 C ~SIL 3 A ~SIL 4 B ~SIL
■❇▼
ELEN E6884: Speech Recognition 87
1 2 THE 3 DOG
1 2 THE:DH 3 DOG:D <epsilon>:AH <epsilon>:IY 4 <epsilon>:AO <epsilon>:G
1 2 DH 3 AH IY 4 D 5 AO 6 G
■❇▼
ELEN E6884: Speech Recognition 88
1 2 D 3 AO 4 G
1 2 D:D1 4 AO:AO1 6 G:G1 <epsilon>:D1 3 <epsilon>:D2 <epsilon>:AO1 5 <epsilon>:AO2 <epsilon>:G1 7 <epsilon>:G2 <epsilon>:<epsilon> <epsilon>:D2 <epsilon>:<epsilon> <epsilon>:AO2 <epsilon>:<epsilon> <epsilon>:G2
1 2 D1 D1 3 D2 D2 4 AO1 AO1 5 AO2 AO2 6 G1 G1 7 G2 G2
■❇▼
ELEN E6884: Speech Recognition 89
■ for now, pretend no ǫ-labels ■ for every state s ∈ A, t ∈ T, create state (s, t) ∈ A ◦ T ■ create arc from (s1, t1) to (s2, t2) with label o iff . . .
■ (s, t) is initial iff s and t are initial; similarly for final states ■ (remove arcs and states that cannot reach both an initial and
■ efficient
■❇▼
ELEN E6884: Speech Recognition 90
1 2 a 3 b
1 2 a:A 3 b:B
1,1 2,2 A 3,3 B 1,2 1,3 2,1 2,3 3,1 3,2 ■ optimization: start from initial state, build outward
■❇▼
ELEN E6884: Speech Recognition 91
1 2 a 3 a b b
1 2 a:A b:B a:a b:b
1,1 3,2 A 2,2 A b 3,1 b 1,2 B a 2,1 a B
■❇▼
ELEN E6884: Speech Recognition 92
■ basic idea: can take ǫ-transition in one FSM without moving in
1 2 <epsilon> A 3 B 1 2 <epsilon>:B A:A 3 B:B
1,1 2,2 A 1,2 B 2,1 eps 3,3 B eps 1,3 2,3 eps B 3,1 3,2 B
■❇▼
ELEN E6884: Speech Recognition 93
■ step 1: rewrite each phone as a triphone
■ step 2: rewrite each triphone with correct context-dependent
■❇▼
ELEN E6884: Speech Recognition 94
1 2 x 3 y 4 y 5 x 6 y
x_x x:x_x_x x_y x:x_x_y y_y y:x_y_y y_x y:x_y_x y:y_y_y y:y_y_x x:y_x_x x:y_x_y
1 2 x_x_y y_x_y 3 x_y_y 4 y_y_x 5 y_x_y 6 x_y_y x_y_x
■❇▼
ELEN E6884: Speech Recognition 95
1 2 x_x_y y_x_y 3 x_y_y 4 y_y_x 5 y_x_y 6 x_y_y x_y_x
■ point: composition automatically expands FSA to correctly
■❇▼
ELEN E6884: Speech Recognition 96
■ graph expansion can be expressed as series of composition
■ composition is efficient ■ context-dependent expansion can be handled effortlessly
■❇▼
ELEN E6884: Speech Recognition 97
■ e.g., to hold language model probs, transition probs, etc. ■ FSM’s ⇒ weighted FSM’s
■ each arc has a score or cost
1 2/1 a/0.3 c/0.4 3/0.4 b/1.3 a/0.2 <epsilon>/0.6
■❇▼
ELEN E6884: Speech Recognition 98
■ typically, we take costs to be negative log probabilities
1 2 a/1 3/3 b/2 1 2 a/0 3/6 b/0
■ if two paths have same labels, can be combined into one
1 2 a/1 a/2 b/3 3/0 c/0 1 2 a/1 b/3 3/0 c/0
■ operations (+, min) form a semiring (the tropical semiring)
■❇▼
ELEN E6884: Speech Recognition 99
■ FSM’s are equivalent if same label sequences with same costs
1 2/1 a/0 1 2/0.5 a/0.5 a/1 1 2 <epsilon>/1 3/0 a/0 1 2/-2 a/3 3 b/1 b/1
■❇▼
ELEN E6884: Speech Recognition 100
1 2 a/1 3 b/0 4/0 d/2
1/1 a:A/2 b:B/1 c:C/0 d:D/0
1 2 A/3 3 B/1 4/1 D/2
■❇▼
ELEN E6884: Speech Recognition 101
■ start with weighted FSA representing language model ■ use composition to apply FST for each level of expansion
■❇▼
ELEN E6884: Speech Recognition 102
■ WFSA’s and WFST’s can represent many important structures
■ composition can do lots of useful things, including . . .
■❇▼
ELEN E6884: Speech Recognition 103
■ Unit I: you do not talk about Unit I ■ Unit II: acoustic model training for LVCSR ■ Unit III: decoding for LVCSR (inefficient)
■
■❇▼
ELEN E6884: Speech Recognition 104
■❇▼
ELEN E6884: Speech Recognition 105