ELEN E6884/COMS 86884 Speech Recognition Lecture 7 Michael - - PowerPoint PPT Presentation

elen e6884 coms 86884 speech recognition lecture 7
SMART_READER_LITE
LIVE PREVIEW

ELEN E6884/COMS 86884 Speech Recognition Lecture 7 Michael - - PowerPoint PPT Presentation

ELEN E6884/COMS 86884 Speech Recognition Lecture 7 Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, NY, USA { picheny,eeide,stanchen } @us.ibm.com 20 October 2005 ELEN E6884: Speech


slide-1
SLIDE 1

ELEN E6884/COMS 86884 Speech Recognition Lecture 7

Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, NY, USA {picheny,eeide,stanchen}@us.ibm.com 20 October 2005

■❇▼

ELEN E6884: Speech Recognition

slide-2
SLIDE 2

Administrivia

■ main feedback from last lecture

  • everything was pretty clear (eventually)

■ Ellen will be away for a while

  • preparing for season 4 tryouts of Nashville Star

■ sample answers to Lab 1 posted

  • in same directory you got Lab 1 files from

■ Lab 2 due Sunday midnight ■ Lab 3 out Monday?

■❇▼

ELEN E6884: Speech Recognition 1

slide-3
SLIDE 3

The Big Picture

■ weeks 1–4: small vocabulary ASR ■ weeks 5–8: large vocabulary ASR

  • week 5: language modeling (for large vocabularies)
  • week 6: pronunciation modeling — acoustic modeling for

large vocabularies

  • week 7, 8: training, decoding for large vocabularies

■ weeks 9–13: advanced topics

■❇▼

ELEN E6884: Speech Recognition 2

slide-4
SLIDE 4

The Fundamental Equation of Speech Recognition

class(x) = arg max

ω

P(ω|x) = arg max

ω

P(ω)P(x|ω) P(x) = arg max

ω

P(ω)P(x|ω)

■ P(x|ω) — acoustic model ■ P(ω) — language model

■❇▼

ELEN E6884: Speech Recognition 3

slide-5
SLIDE 5

Outline

■ Unit I: you do not talk about Unit I ■ Unit II: acoustic model training for LVCSR ■ Unit III: decoding for LVCSR (inefficient)

  • Unit IV: introduction to finite-state transducers

■ Unit V: search (lecture 8)

  • making decoding for LVCSR efficient

■❇▼

ELEN E6884: Speech Recognition 4

slide-6
SLIDE 6

Unit II: Acoustic Model Training for LVCSR

Small vocabulary training — Lab 2

■ small model

  • 102 HMM states spread over 11 word models
  • 102 12-dimensional Gaussians
  • ⇒ 102 × 12 × 2 = 2448 parameters

■ simple training recipe

  • flat start: mean 0, variance 1
  • run a bunch of iterations of Forward-Backward training
  • done!

■❇▼

ELEN E6884: Speech Recognition 5

slide-7
SLIDE 7

Acoustic Model Training

What happens when we train more complex acoustic models?

■ single Gaussians ⇒ Gaussian mixture models (GMM’s) ■ isolated speech ⇒ continuous speech ■ word models ⇒ context-dependent (CD) phone models ■ 2500 Gaussian parameters ⇒ tens of millions of Gaussian

parameters

■ flat start and FB?

■❇▼

ELEN E6884: Speech Recognition 6

slide-8
SLIDE 8

Case Study: Training a Mixture of Two 2-D Gaussians

The Data: real live acoustic features

■ front end from lab 1; take first two dimensions; 546 frames

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 7

slide-9
SLIDE 9

Training a Mixture of Two 2-D Gaussians

Flat start?

■ initialize mean of each Gaussian to 0, variance to 1 ■ what do you think will happen?

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 8

slide-10
SLIDE 10

Training a Mixture of Two 2-D Gaussians

“At the Mr. O level, symmetry is everything.”

■ at the GMM level, symmetry is a bad idea.

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 9

slide-11
SLIDE 11

Training a Mixture of Two 2-D Gaussians

Random seeding?

■ picked 8 random starting points ⇒ 3 different optimum found ■ training is not simple even for simple models

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 10

slide-12
SLIDE 12

Training Hidden Models

(MLE) training of models with hidden variables has local minima

■ example: GMM

  • hidden quantity: for each feature vector in training data, which

Gaussian in mixture generated it

■ example: HMM

  • hidden quantity: alignment; for each frame in training data,

which arc generated it

■❇▼

ELEN E6884: Speech Recognition 11

slide-13
SLIDE 13

Gradient Descent and Local Minima

FB training does hill-climbing/gradient descent

■ finds “nearest” optimum to where you started ■ picking a good starting point is key ■ chicken and egg problem

likelihood parameter values

■❇▼

ELEN E6884: Speech Recognition 12

slide-14
SLIDE 14

Secrets of Acoustic Model Training, Uncovered!

Unit overview

■ discuss training process in more depth ■ reveal strategies for finding ML parameter estimates for

complex models

  • discovered via sweat and tears
  • not true ML estimates, but as close as we can get
  • art, not science

■ in practice, training is tortuous multistage process

  • use simpler models to bootstrap more complex models

■❇▼

ELEN E6884: Speech Recognition 13

slide-15
SLIDE 15

Aside: Not Truly Maximum Likelihood

■ variance flooring

  • don’t let variances go to 0 ⇒ infinite likelihood

■ just as LM’s need to be smoothed or regularized

  • so do acoustic models
  • penalize undesirable parameter values/unsmooth models
  • variance flooring is poor person’s regularization

■❇▼

ELEN E6884: Speech Recognition 14

slide-16
SLIDE 16

Baby Steps

Let’s start simple and consider more complex models in turn

■ from word models; single Gaussians; isolated words . . . ■ to context-dependent phone models; GMM’s; continuous words

■❇▼

ELEN E6884: Speech Recognition 15

slide-17
SLIDE 17

Single Gaussian Word Models, Isolated Word

■ Phase 1: Collect underpants

  • initialize all Gaussian means to 0, variances to 1

■ Phase 2: Iterate over training data

  • for each word, train associated word HMM . . .
  • on all samples of that word in the training data . . .
  • using the Forward-Backward algorithm

■ Phase 3: Profit!

■❇▼

ELEN E6884: Speech Recognition 16

slide-18
SLIDE 18

Single Gaussian Word Models, Isolated Word

Why does this work?

■ we believe there’s a huge local minima in the “middle” of the

parameter search space

  • with a neutral starting point, we’re apt to fall into it
  • (who knows if this is actually true)

■ another perspective

  • the model doesn’t have enough freedom to screw up
  • the only way it can achieve a good likelihood is . . .
  • if the Gaussians for a particular phone (e.g., AH) . . .
  • actually model the acoustic realizations of that phone

■❇▼

ELEN E6884: Speech Recognition 17

slide-19
SLIDE 19

Bootstrapping Big Models From Small Models

■ how can we train more complex models . . .

  • where flat/random start will almost certainly do poorly?

■ start with a model simple enough that a flat start works ■ then, can we use this simple model . . .

  • to give us hints on how to seed the parameters of a larger

model?

■ if so, can iteratively build more and more complex models ■ case study: training mixtures of Gaussians

  • recursive mixture splitting
  • k-means clustering

■❇▼

ELEN E6884: Speech Recognition 18

slide-20
SLIDE 20

Gaussian Mixture Splitting

■ start with single Gaussian per mixture (trained) ■ split each Gaussian into two

  • perturb means in opposite directions; same variance
  • train

■ repeat until reach desired number of mixture components (1, 2,

4, 8, . . . )

  • (discard Gaussians with insufficient counts)

■ assumption: n-component Gaussian mixture gives good hints

  • n how to seed 2n-component Gaussian mixture

■❇▼

ELEN E6884: Speech Recognition 19

slide-21
SLIDE 21

Mixture Splitting Example

■ train single Gaussian

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 20

slide-22
SLIDE 22

Mixture Splitting Example

■ split each Gaussian in two (±0.2 ×

σ)

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 21

slide-23
SLIDE 23

Mixture Splitting Example

■ train, yep

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 22

slide-24
SLIDE 24

Mixture Splitting Example

■ split each Gaussian in two (±0.2 ×

σ)

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 23

slide-25
SLIDE 25

Mixture Splitting Example

■ train, yep

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 24

slide-26
SLIDE 26

Using Mixture Splitting in Acoustic Model Training

■ train model where each output distribution is single Gaussian (`

a la Lab 2)

■ split Gaussians in each output distribution simultaneously ■ train whole model with FB ■ repeat

■❇▼

ELEN E6884: Speech Recognition 25

slide-27
SLIDE 27

Another Seeding Method: Use Automatic Clustering

■ instead of recursive divide-and-conquer method . . . ■ use clustering algorithm on data to find desired number of

cluster centers all at once

  • use cluster centers to seed Gaussian means
  • initialize variances to constant

■ (discard Gaussians with insufficient counts)

■❇▼

ELEN E6884: Speech Recognition 26

slide-28
SLIDE 28

k-Means Clustering

Simple and effective clustering algorithm

■ select desired number of clusters k ■ choose k data points randomly

  • use these as initial cluster centers

■ “assign” each data point to nearest cluster center ■ recompute each cluster center as . . .

  • mean of data points “assigned” to it

■ repeat until convergence

■❇▼

ELEN E6884: Speech Recognition 27

slide-29
SLIDE 29

k-Means Example

■ pick random cluster centers; assign each point to nearest center

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 28

slide-30
SLIDE 30

k-Means Example

■ recompute cluster centers

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 29

slide-31
SLIDE 31

k-Means Example

■ assign each point to nearest center

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 30

slide-32
SLIDE 32

k-Means Example

■ repeat until convergence

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 31

slide-33
SLIDE 33

k-Means Example

■ use centers as means of Gaussians; train, yep

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 32

slide-34
SLIDE 34

The Final Mixtures, Splitting vs. k-Means

  • 4
  • 2

2 4

  • 10
  • 5

5 10

  • 4
  • 2

2 4

  • 10
  • 5

5 10

■❇▼

ELEN E6884: Speech Recognition 33

slide-35
SLIDE 35

Technical Aside: k-Means Clustering

■ when using Euclidean distance to compute “nearest” center . . . ■ k-means clustering is equivalent to . . .

  • seeding k-component GMM means with the k initial centers
  • doing “hard” GMM update
  • instead of assigning true posterior to each Gaussian in

update . . .

  • assign “posterior” of 1 to most likely Gaussian and 0 to the
  • thers
  • keeping variances constant

■❇▼

ELEN E6884: Speech Recognition 34

slide-36
SLIDE 36

Using k-Means Clustering in Acoustic Model Training

■ for each GMM/output distribution, use k-means clustering . . .

  • on acoustic feature vectors “associated” with that GMM . . .
  • to seed means of that GMM

■ huh?

  • how to decide which frames belong to which GMM?
  • we are told which word (HMM) belongs to each training

utterance

  • but we aren’t told which HMM arc (output distribution)

belongs to each frame

■ how can we compute this?

■❇▼

ELEN E6884: Speech Recognition 35

slide-37
SLIDE 37

Forced Alignment

■ Viterbi algorithm

  • given acoustic model, finds most likely alignment of HMM to

data

  • not perfect, but what can you do?

P1(x) P1(x) P2(x) P2(x) P3(x) P3(x) P4(x) P4(x) P5(x) P5(x) P6(x) P6(x)

frame 1 2 3 4 5 6 7 8 9 10 11 12 arc P1 P1 P1 P2 P3 P4 P4 P5 P5 P5 P5 P6 P6

■ need existing model to create alignment . . .

  • for seeding means for GMM’s in new model
  • use best existing model you have available!
  • alignment will only be as good as model

■❇▼

ELEN E6884: Speech Recognition 36

slide-38
SLIDE 38

Lessons: Training GMM’s

■ hidden models have local minima galore! ■ smaller models can help seed larger models

  • mixture splitting
  • use n-component GMM to seed 2n-component GMM
  • k-means
  • use existing model to provide GMM⇔frame alignment

■ heuristics have been developed that work OK

  • mixture splitting and k-means are comparable
  • but no one believes these find global optima, even for

relatively small problems

  • these are not the last word!

■❇▼

ELEN E6884: Speech Recognition 37

slide-39
SLIDE 39

Single Gaussians ⇒ GMM’s

The training recipe so far

■ train single Gaussian models (flat start; many iterations of FB) ■ do mixture splitting, say

  • split each Gaussian in two; many iterations of FB
  • repeat until desired number of Gaussians per mixture

■❇▼

ELEN E6884: Speech Recognition 38

slide-40
SLIDE 40

Unit II: Acoustic Model Training for LVCSR

What’s next?

■ single Gaussians ⇒ Gaussian mixture models (GMM’s) ■

isolated speech ⇒ continuous speech

■ word models ⇒ context-independent (CI) phone models ■ CI phone models ⇒ context-dependent (CD) phone models

■❇▼

ELEN E6884: Speech Recognition 39

slide-41
SLIDE 41

From Isolated to Continuous Speech

■ isolated speech with word models

  • train each word HMM using only instances of that word

■ continuous speech

  • don’t have instances of individual words nicely separated out
  • don’t know when each word begins and ends in an utterance

■ what to do?

■❇▼

ELEN E6884: Speech Recognition 40

slide-42
SLIDE 42

From Isolated to Continuous Speech

Strategy A (Viterbi-style training)

■ do forced alignment

  • for each training utterance, build HMM by . . .
  • concatenating word HMM’s for words in reference transcript
  • do Viterbi algorithm; recover best alignment
  • see board

■ snip each utterance into individual words

  • reduces to isolated word training

■ what are possible issues with this approach?

■❇▼

ELEN E6884: Speech Recognition 41

slide-43
SLIDE 43

From Isolated to Continuous Speech

Strategy B

■ instead of snipping the concatenated word HMM and snipping

the acoustic feature vectors . . .

  • and running FB on each word HMM+segment separately . . .
  • what if we just run FB on the whole darn thing!?

■ does this make sense?

  • like having an HMM for each word sequence rather than for

each word . . .

  • where parameters for all instances of same word are tied
  • analogy: like using phonetic models for isolated speech
  • each word (phone sequence) has its own HMM . . .
  • where parameters for all instances of same phone are tied

■❇▼

ELEN E6884: Speech Recognition 42

slide-44
SLIDE 44

Pop Quiz

■ To do one iteration of FB, which strategy is faster?

  • Hint: what is the time complexity of FB?

■ Which strategy is less prone to local minima? ■ in practice, both styles of strategies are used

  • including an extreme version of Strategy A

■❇▼

ELEN E6884: Speech Recognition 43

slide-45
SLIDE 45

But Wait, It’s More Complicated Than That!

■ reference transcripts are created by humans . . .

  • who, by their nature, are human (i.e., fallible)

■ typical transcripts don’t contain everything an ASR system

wants

  • where silence occurred; noises like coughs, door slams, etc.
  • pronunciation information, e.g., was

THE pronounced as

DH UH or DH IY?

■ how can we correctly construct the HMM for an utterance?

  • where do we insert the silence HMM?
  • which pronunciation variant to use for each word?
  • if have different HMM’s for different pronunciations of a word

■❇▼

ELEN E6884: Speech Recognition 44

slide-46
SLIDE 46

Pronunciation Variants, Silence, and Stuff

■ that is, the human-produced transcript is incomplete

  • how can we produce a more complete transcript?

■ Viterbi decoding!

  • build HMM accepting all word (HMM) sequences consistent

with reference transcript

  • compute best path/word HMM sequence

~SIL(01) THE(01) THE(02) ~SIL(01) DOG(01) DOG(02) DOG(03) ~SIL(01) ~SIL(01) THE(01) DOG(02) ~SIL(01)

■❇▼

ELEN E6884: Speech Recognition 45

slide-47
SLIDE 47

Pronunciation Variants, Silence, and Stuff

Where does the initial acoustic model come from?

■ train initial model without silence; single pronunciation per word ■ use HMM containing all alternatives directly in training (e.g., Lab

2)

  • not clear what interpretation is, but works for bootstrapping

~SIL(01) THE(01) THE(02) ~SIL(01) DOG(01) DOG(02) DOG(03) ~SIL(01)

■❇▼

ELEN E6884: Speech Recognition 46

slide-48
SLIDE 48

Isolated Speech ⇒ Continuous Speech

The training recipe so far

■ train an initial GMM system (Lab 2 stopped here)

  • same recipe as before, except create HMM for each training

utterance by concatenating word HMM’s

■ use initial system to refine reference transcripts

  • select pronunciation variants, where silence occurs

■ do more FB on initial system or retrain from scratch

  • using refined transcripts to build HMM’s

■❇▼

ELEN E6884: Speech Recognition 47

slide-49
SLIDE 49

Unit II: Acoustic Model Training for LVCSR

What’s next?

■ single Gaussians ⇒ Gaussian mixture models (GMM’s) ■ isolated speech ⇒ continuous speech ■

word models ⇒ context-independent (CI) phone models

■ CI phone models ⇒ context-dependent (CD) phone models

■❇▼

ELEN E6884: Speech Recognition 48

slide-50
SLIDE 50

Word Models

HMM/graph expansion

■ reference transcript

THE DOG

■ replace each word with its HMM

THE1 THE2 THE3 THE4 DOG1 DOG2 DOG3 DOG4 DOG5 DOG6

■❇▼

ELEN E6884: Speech Recognition 49

slide-51
SLIDE 51

Context-Independent Phone Models

HMM/graph expansion

■ reference transcript

THE DOG

■ pronunciation dictionary

  • maps each word to a sequence of phonemes

DH AH D AO G

■ replace each phone with its HMM

DH1 DH2 AH1 AH2 D1 D2 AO1 AO2 G1 G2

■❇▼

ELEN E6884: Speech Recognition 50

slide-52
SLIDE 52

Word Models ⇒ Context-Independent Phone Models

Changes

■ need pronunciation of every word in training data

  • including pronunciation variants

THE(01)

DH AH

THE(02)

DH IY

  • listen to data? use automatic spelling-to-sound models?

■ how the HMM for each training utterance is created

■❇▼

ELEN E6884: Speech Recognition 51

slide-53
SLIDE 53

Word Models ⇒ Context-Independent Phone Models

The training recipe so far

■ build pronunciation dictionary for all words in training set ■ train an initial GMM system ■ use initial system to refine reference transcripts ■ do more FB on initial system or retrain from scratch

■❇▼

ELEN E6884: Speech Recognition 52

slide-54
SLIDE 54

Unit II: Acoustic Model Training for LVCSR

What’s next?

■ single Gaussians ⇒ Gaussian mixture models (GMM’s) ■ isolated speech ⇒ continuous speech ■ word models ⇒ context-independent (CI) phone models ■

CI phone models ⇒ context-dependent (CD) phone models

■❇▼

ELEN E6884: Speech Recognition 53

slide-55
SLIDE 55

CI ⇒ CD Phone Models

■ context-independent phone models

  • there are ∼50 phonemes
  • each has a ∼3 state HMM ⇒ ∼150 CI HMM states
  • each CI HMM state has its own GMM ⇒ ∼150 GMM’s

■ context-dependent models

  • each of the ∼150 HMM states now has a set of 1–100 GMM’s

attached to it

  • which of the 1–100 GMM’s to use is determined by the

phonetic context . . .

  • by using a decision tree
  • e.g., for first state of phone AX, if DH to left and stop

consonant to right, then use GMM37, else . . .

■❇▼

ELEN E6884: Speech Recognition 54

slide-56
SLIDE 56

Context-Dependent Phone Models

Notes

■ not one decision tree per phoneme, but one per phoneme state

  • better model of reality
  • GMM for first state in HMM depends on left context mostly
  • GMM for last state in HMM depends on right context mostly

■ terminology

  • triphone model — look at ±1 phones of context
  • quinphone model — look at ±2 phones of context
  • also, septaphone and 11-phone models

■❇▼

ELEN E6884: Speech Recognition 55

slide-57
SLIDE 57

Context-Dependent Phone Models

Typical model sizes type HMM GMM’s/state GMM’s Gaussians word per word 1 10–500 100–10k CI phone per phone 1 ∼150 1k–3k CD phone per phone 1–100 1k–10k 10k–300k

■ 39-dimensional feature vectors ⇒ ∼80 parameters/Gaussian ■ big models can have tens of millions of parameters

■❇▼

ELEN E6884: Speech Recognition 56

slide-58
SLIDE 58

Building a Triphone Phonetic Decision Tree

■ in a CI model, consider the GMM for a state, e.g., AH1

  • this is a probability distribution p(

x|AH1) . . .

  • over acoustic feature vectors

x

■ context-dependent modeling assumes . . .

  • we can build better model of acoustic realizations of AH1 . . .
  • if we condition on the surrounding phones, e.g., for a triphone

model, p( x|AH1, pL, pR)

■ what do we mean by better model? ■ how do we build this better model?

■❇▼

ELEN E6884: Speech Recognition 57

slide-59
SLIDE 59

Building a Triphone Phonetic Decision Tree

■ what do we mean by better model?

  • maximum likelihood!?
  • the model p(

x|AH1, pL, pR) should assign a higher total likelihood than p( x|AH1) to some data x1, x2, . . .

■ on what data?

  • all frames

x in the training data . . .

  • that correspond to the state/sound AH1

■ how do we find this data?

■❇▼

ELEN E6884: Speech Recognition 58

slide-60
SLIDE 60

Training Data for Decision Trees

■ forced alignment/Viterbi decoding! ■ where do we get the model to align with from?

  • use CI phone model or other pre-existing model

DH1 DH2 AH1 AH2 D1 D2 AO1 AO2 G1 G2

frame 1 2 3 4 5 6 7 8 9 · · · arc DH1 DH2 AH1 AH2 D1 D1 D2 D2 D2 AO1 · · ·

■❇▼

ELEN E6884: Speech Recognition 59

slide-61
SLIDE 61

Building a Triphone Phonetic Decision Tree

■ build decision tree for AH1 to optimize likelihood of acoustic

feature vectors aligned to AH1

  • predetermined question set
  • see lecture 6 slides, readings for gory details

■ the CD probability distribution: p(

x|leaf(AH1, pL, pR))

  • there is a GMM at each leaf of the tree
  • context-independent ⇔ tree with single leaf

■❇▼

ELEN E6884: Speech Recognition 60

slide-62
SLIDE 62

Goldilocks and The Three Parameterizations

Perspective

■ one GMM per phone state

  • too few parameters; doesn’t model the many allophones of a

phoneme

■ one GMM per phone state and triphone context (∼ 50 × 50)

  • too many parameters; sparse data issues

■ cluster triphone contexts using decision tree

  • each leaf represents a cluster of triphone contexts . . .
  • with (hopefully) similar acoustic realizations that can be

modeled with single GMM

  • just right!

■❇▼

ELEN E6884: Speech Recognition 61

slide-63
SLIDE 63

Training Context-Dependent Models

OK, let’s say we have decision trees; how to train our new GMM’s?

■ how can we seed the context-dependent GMM parameters?

  • e.g., what if we have a CI model?
  • what if we have an existing CD model but with a different tree?

■ once you have a good model for a domain

  • can use to quickly bootstrap other models
  • why might this be a bad idea?

■❇▼

ELEN E6884: Speech Recognition 62

slide-64
SLIDE 64

Training Context-Dependent Models

HMM/graph expansion

THE DOG DH AH D AO G

DH1 DH2 AH1 AH2 D1 D2 AO1 AO2 G1 G2

DH1,3 DH2,7 AH1,2 AH2,4 D1,3 D2,9 AO1,1 AO2,1 G1,2 G2,7

■❇▼

ELEN E6884: Speech Recognition 63

slide-65
SLIDE 65

CI ⇒ CD Phone Models

The training recipe so far

■ build CI model using previous recipe ■ use CI model to align training data

  • use alignment to build phonetic decision tree

■ use CI model to seed CD model ■ train CD model using FB

■❇▼

ELEN E6884: Speech Recognition 64

slide-66
SLIDE 66

Whew, That Was Pretty Complicated!

Or not

■ adaptation (VTLN, fMLLR, mMLLR) ■ discriminative training (LDA, MMI, MPE, fMPE) ■ model combination (cross adaptation, ROVER) ■ iteration

  • repeat steps using better model for seeding
  • alignment is only as good as model that created it

■❇▼

ELEN E6884: Speech Recognition 65

slide-67
SLIDE 67

Things Can Get Pretty Hairy

ML-SAT-L ML-AD-L

ROVER

Consensus

rescoring 100-best rescoring 100-best 4-gram rescoring 4-gram rescoring 4-gram rescoring 4-gram rescoring 4-gram rescoring

Consensus Consensus Consensus Consensus Consensus

rescoring 100-best 4-gram rescoring 4-gram rescoring 4-gram rescoring 4-gram rescoring

Consensus Consensus Consensus

36.3%

MFCC ML-SAT-L VTLN ML-AD-L ML-SAT ML-AD MMI-SAT MMI-AD ML-SAT ML-AD MFCC-SI PLP VTLN MMI-SAT MMI-AD Consensus

4-gram 100-best rescoring rescoring 38.4% Eval’01 WER 35.6% 31.6% 30.3% 30.1% 30.5% 31.0% 32.1% 29.9% 31.1% 30.2% 28.8% 28.7% 31.4% 29.2% 27.8% 29.2% 29.5% 30.1% 29.8% 30.9% 31.9% 34.3% 42.6% 45.9% Eval’98 WER (SWB only) 34.0% 41.6% 39.3% 38.5% 37.7% 38.7% 38.1% 36.7% 38.7% 30.8% 37.9% 38.1% 37.1% 36.9% 35.9% 35.2% 35.7% 36.5% 38.1% 37.2% 35.5% 37.7%

■❇▼

ELEN E6884: Speech Recognition 66

slide-68
SLIDE 68

Unit II: Acoustic Model Training for LVCSR

■ take-home messages

  • hidden model training is fraught with local minima
  • seeding more complex models with simpler models helps

avoid terrible local minima

  • people have developed recipes/heuristics to try to improve the

minimum you end up in

  • no one best recipe
  • training is insanely complicated for state-of-the-art research

models

■ the good news is . . .

  • I just saved a bunch on money on my car insurance by

switching to GEICO

■❇▼

ELEN E6884: Speech Recognition 67

slide-69
SLIDE 69

Unit III: Decoding for LVCSR (Inefficient)

class(x) = arg max

ω

P(ω|x) = arg max

ω

P(ω)P(x|ω) P(x) = arg max

ω

P(ω)P(x|ω)

■ now that we know how to build models for LVCSR . . .

  • CD acoustic models via complex recipes
  • n-gram models via counting and smoothing

■ how can we use them for decoding?

  • let’s ignore memory and speed constraints for now

■❇▼

ELEN E6884: Speech Recognition 68

slide-70
SLIDE 70

Decoding

What did we do for small vocabulary tasks?

■ take graph/FSA represent language model

LIKE UH

  • i.e., all allowed word sequences

■ expand to underlying HMM

LIKE UH

■ run the Viterbi algorithm!

■❇▼

ELEN E6884: Speech Recognition 69

slide-71
SLIDE 71

Decoding

Well, can we do the same thing for LVCSR?

■ Issue 1: Can we express an n-gram model as an FSA?

  • yup

h=w1 w1/P(w1|w1) h=w2 w2/P(w2|w1) w1/P(w1|w2) w2/P(w2|w2)

h=w1,w1 w1/P(w1|w1,w1) h=w1,w2 w2/P(w2|w1,w1) h=w2,w1 w1/P(w1|w1,w2) h=w2,w2 w2/P(w2|w1,w2) w1/P(w1|w2,w1) w2/P(w2|w2,w1) w1/P(w1|w2,w2) w2/P(w2|w2,w2)

■❇▼

ELEN E6884: Speech Recognition 70

slide-72
SLIDE 72

n-Gram Models as HMM’s

■ probability assigned to path is LM probability of words along

that path

■ do bigram example on board

■❇▼

ELEN E6884: Speech Recognition 71

slide-73
SLIDE 73

Pop Quiz

■ how many states in the FSA representing an n-gram model . . .

  • with vocabulary size |V |?

■ how many arcs?

■❇▼

ELEN E6884: Speech Recognition 72

slide-74
SLIDE 74

Decoding

Issue 2: How can we expand a word graph to its underlying HMM?

■ word models

  • replace each word with its HMM

■ CI phone models

  • replace each word with its phone sequence(s)
  • replace each phone with its HMM

h=LIKE LIKE/P(LIKE|LIKE) UH/P(UH|LIKE) h=UH LIKE/P(LIKE|UH) UH/P(UH|UH)

■❇▼

ELEN E6884: Speech Recognition 73

slide-75
SLIDE 75

Graph Expansion with Context-Dependent Models

DH D AH AO G

■ how can we do context-dependent expansion?

  • handling branch points is tricky

■ example of triphone expansion

G_D_AO D_AO_G AO_G_D AO_G_DH G_DH_AH DH_AH_DH DH_AH_D AH_DH_AH AH_D_AO

■ other tricky cases

  • words consisting of a single phone
  • quinphone models

■❇▼

ELEN E6884: Speech Recognition 74

slide-76
SLIDE 76

Word-Internal Acoustic Models

Simplify acoustic model to simplify graph expansion

■ word-internal models

  • don’t let decision trees ask questions across word boundaries
  • pad contexts with the unknown phone
  • hurts performance (e.g., coarticulation across words)

■ in graph expansion, just replace each word with its HMM

LIKE UH LIKE UH

■❇▼

ELEN E6884: Speech Recognition 75

slide-77
SLIDE 77

Graph Expansion with Context-Dependent Models

Is there a better way?

■ is there some elegant theoretical framework . . . ■ that makes it easy to do this type of expansion . . . ■ and also makes it easy to do lots of other graph operations

useful in ASR?

■ ⇒ finite-state transducers (FST’s)! (Unit IV)

■❇▼

ELEN E6884: Speech Recognition 76

slide-78
SLIDE 78

Unit III: Decoding for LVCSR (Inefficient)

Recap

■ can do same thing we do for small vocabulary decoding

  • start with LM represented as word graph
  • expand to underlying HMM
  • Viterbi

■ how to do the graph expansion? FST’s (Unit IV) ■ how to make decoding efficient? search (Unit V)

■❇▼

ELEN E6884: Speech Recognition 77

slide-79
SLIDE 79

Unit IV: Introduction to Finite-State Transducers

Overview

■ FST’s closely related to finite-state automata (FSA)

  • an FSA is a graph
  • an FST . . .
  • takes an FSA as input . . .
  • and produces a new FSA

■ natural technology for graph expansion . . .

  • and much, much more

■ FST’s for ASR pioneered by AT&T in late 1990’s

■❇▼

ELEN E6884: Speech Recognition 78

slide-80
SLIDE 80

Review: What is a Finite-State Acceptor?

■ it has states

  • exactly one initial state; one or more final states

■ it has arcs

  • each arc has a label, which may be empty (ǫ)

■ ignore probabilities for now

1 2 a c 3 b a <epsilon>

■❇▼

ELEN E6884: Speech Recognition 79

slide-81
SLIDE 81

Pop Quiz

■ What are the differences between the following:

  • HMM’s with discrete output distributions
  • FSA’s with arc probabilities

■ Can they express the same class of models?

■❇▼

ELEN E6884: Speech Recognition 80

slide-82
SLIDE 82

What is a Finite-State Transducer?

■ it’s like a finite-state acceptor, except . . . ■ each arc has two labels instead of one

  • an input label (possibly empty)
  • an output label (possibly empty)

1 2 a:<epsilon> c:c 3 b:a a:a <epsilon>:b

■❇▼

ELEN E6884: Speech Recognition 81

slide-83
SLIDE 83

Terminology

■ finite-state acceptor (FSA): one label on each arc ■ finite-state transducer (FST): input and output label on each arc ■ finite-state machine (FSM): FSA or FST

  • also, finite-state automaton

■ incidentally, an FSA can act like an FST

  • duplicate label to be both input and output label

■❇▼

ELEN E6884: Speech Recognition 82

slide-84
SLIDE 84

How Can We Apply an FST to an FSA?

Composition operation

■ perspective: rewriting/transforming token sequences

A

1 2 a 3 b 4 d

T

1 2 a:A 3 b:B 4 d:D

A ◦ T

1 2 A 3 B 4 D

■❇▼

ELEN E6884: Speech Recognition 83

slide-85
SLIDE 85

Composition

Another example A

1 2 a 3 b 4 d

T

1 a:A b:B c:C d:D

A ◦ T

1 2 A 3 B 4 D

■❇▼

ELEN E6884: Speech Recognition 84

slide-86
SLIDE 86

Composition

Rewriting many paths at once A

1 2 c d 6 b 3 a 5 a a 4 b d

T

1 a:A b:B c:C d:D

A ◦ T

1 3 B 2 C D 4 A A 5 A 6 D B

■❇▼

ELEN E6884: Speech Recognition 85

slide-87
SLIDE 87

Composition

Formally, if composing FSA A with FST T to get FSA A ◦ T:

■ for every complete path (from initial to final state) in A . . .

  • with input labels i1 · · · iN (ignoring ǫ labels) . . .

■ and for every complete path in T . . .

  • with input labels i1 · · · iN and . . .
  • with output labels o1 · · · oM . . .

■ there is a complete path in A ◦ T . . .

  • with input labels o1 · · · oM (ignoring ǫ labels)

■ we will discuss how to construct A ◦ T shortly

■❇▼

ELEN E6884: Speech Recognition 86

slide-88
SLIDE 88

Composition

Many graph expansion operations can be represented as FST’s

■ example 1: optional silence insertion in training graphs

A

1 2 C 3 A 4 B

T

1 <epsilon>:~SIL A:A B:B C:C

A ◦ T

1 ~SIL 2 C ~SIL 3 A ~SIL 4 B ~SIL

■❇▼

ELEN E6884: Speech Recognition 87

slide-89
SLIDE 89

Example 2: Rewriting Words as Phone Sequences

THE(01)

DH AH

THE(02)

DH IY A

1 2 THE 3 DOG

T

1 2 THE:DH 3 DOG:D <epsilon>:AH <epsilon>:IY 4 <epsilon>:AO <epsilon>:G

A ◦ T

1 2 DH 3 AH IY 4 D 5 AO 6 G

■❇▼

ELEN E6884: Speech Recognition 88

slide-90
SLIDE 90

Example 3: Rewriting CI Phones as HMM’s

A

1 2 D 3 AO 4 G

T

1 2 D:D1 4 AO:AO1 6 G:G1 <epsilon>:D1 3 <epsilon>:D2 <epsilon>:AO1 5 <epsilon>:AO2 <epsilon>:G1 7 <epsilon>:G2 <epsilon>:<epsilon> <epsilon>:D2 <epsilon>:<epsilon> <epsilon>:AO2 <epsilon>:<epsilon> <epsilon>:G2

A ◦ T

1 2 D1 D1 3 D2 D2 4 AO1 AO1 5 AO2 AO2 6 G1 G1 7 G2 G2

■❇▼

ELEN E6884: Speech Recognition 89

slide-91
SLIDE 91

Computing Composition

■ for now, pretend no ǫ-labels ■ for every state s ∈ A, t ∈ T, create state (s, t) ∈ A ◦ T ■ create arc from (s1, t1) to (s2, t2) with label o iff . . .

  • there is an arc from s1 to s2 in A with label i
  • there is an arc from t1 to t2 in T with input label i and output

label o

■ (s, t) is initial iff s and t are initial; similarly for final states ■ (remove arcs and states that cannot reach both an initial and

final state)

■ efficient

■❇▼

ELEN E6884: Speech Recognition 90

slide-92
SLIDE 92

Computing Composition

Example A

1 2 a 3 b

T

1 2 a:A 3 b:B

A ◦ T

1,1 2,2 A 3,3 B 1,2 1,3 2,1 2,3 3,1 3,2 ■ optimization: start from initial state, build outward

■❇▼

ELEN E6884: Speech Recognition 91

slide-93
SLIDE 93

Computing Composition

Another example (see board) A

1 2 a 3 a b b

T

1 2 a:A b:B a:a b:b

A ◦ T

1,1 3,2 A 2,2 A b 3,1 b 1,2 B a 2,1 a B

■❇▼

ELEN E6884: Speech Recognition 92

slide-94
SLIDE 94

Composition and ǫ-Transitions

■ basic idea: can take ǫ-transition in one FSM without moving in

  • ther FSM
  • a little tricky to do exactly right
  • do the readings if you care: (Pereira, Riley, 1997)

A, T

1 2 <epsilon> A 3 B 1 2 <epsilon>:B A:A 3 B:B

A ◦ T

1,1 2,2 A 1,2 B 2,1 eps 3,3 B eps 1,3 2,3 eps B 3,1 3,2 B

■❇▼

ELEN E6884: Speech Recognition 93

slide-95
SLIDE 95

How to Express CD Expansion via FST’s?

■ step 1: rewrite each phone as a triphone

  • rewrite AX as DH AX R if DH to left, R to right

■ step 2: rewrite each triphone with correct context-dependent

HMM for center phone

  • just like rewriting a CI phone as its HMM
  • need to precompute HMM for each possible triphone (∼ 503)
  • example on board: CI phones ⇒ CD phones ⇒ HMM’s

■❇▼

ELEN E6884: Speech Recognition 94

slide-96
SLIDE 96

How to Express CD Expansion via FST’s?

A

1 2 x 3 y 4 y 5 x 6 y

T

x_x x:x_x_x x_y x:x_x_y y_y y:x_y_y y_x y:x_y_x y:y_y_y y:y_y_x x:y_x_x x:y_x_y

A ◦ T

1 2 x_x_y y_x_y 3 x_y_y 4 y_y_x 5 y_x_y 6 x_y_y x_y_x

■❇▼

ELEN E6884: Speech Recognition 95

slide-97
SLIDE 97

How to Express CD Expansion via FST’s?

Example

1 2 x_x_y y_x_y 3 x_y_y 4 y_y_x 5 y_x_y 6 x_y_y x_y_x

■ point: composition automatically expands FSA to correctly

handle context!

  • makes multiple copies of states in original FSA . . .
  • that can exist in different triphone contexts
  • (and makes multiple copies of only these states)

■❇▼

ELEN E6884: Speech Recognition 96

slide-98
SLIDE 98

Unit IV: Introduction to Finite-State Transducers

What we’ve learned so far:

■ graph expansion can be expressed as series of composition

  • perations
  • need to build FST to represent each expansion step, e.g.,

1 2 THE 2 3 DOG 3

  • with composition operation, we’re done!

■ composition is efficient ■ context-dependent expansion can be handled effortlessly

■❇▼

ELEN E6884: Speech Recognition 97

slide-99
SLIDE 99

What About Those Probability Thingies?

■ e.g., to hold language model probs, transition probs, etc. ■ FSM’s ⇒ weighted FSM’s

  • WFSA’s, WFST’s

■ each arc has a score or cost

  • so do final states

1 2/1 a/0.3 c/0.4 3/0.4 b/1.3 a/0.2 <epsilon>/0.6

■❇▼

ELEN E6884: Speech Recognition 98

slide-100
SLIDE 100

How Are Arc Costs and Probabilities Related?

■ typically, we take costs to be negative log probabilities

  • costs can move back and forth along a path
  • the cost of a path is sum of arc costs plus final cost

1 2 a/1 3/3 b/2 1 2 a/0 3/6 b/0

■ if two paths have same labels, can be combined into one

  • typically, use min operator to compute new cost

1 2 a/1 a/2 b/3 3/0 c/0 1 2 a/1 b/3 3/0 c/0

■ operations (+, min) form a semiring (the tropical semiring)

  • other semirings are possible

■❇▼

ELEN E6884: Speech Recognition 99

slide-101
SLIDE 101

Which Of These Is Different From the Others?

■ FSM’s are equivalent if same label sequences with same costs

1 2/1 a/0 1 2/0.5 a/0.5 a/1 1 2 <epsilon>/1 3/0 a/0 1 2/-2 a/3 3 b/1 b/1

■❇▼

ELEN E6884: Speech Recognition 100

slide-102
SLIDE 102

Weighted Composition

Just add arc costs A

1 2 a/1 3 b/0 4/0 d/2

T

1/1 a:A/2 b:B/1 c:C/0 d:D/0

A ◦ T

1 2 A/3 3 B/1 4/1 D/2

■❇▼

ELEN E6884: Speech Recognition 101

slide-103
SLIDE 103

Weighted Graph Expansion

■ start with weighted FSA representing language model ■ use composition to apply FST for each level of expansion

  • scores/logprobs will be accumulated
  • logprobs may move around along paths
  • all that matters for Viterbi is total score of paths

■❇▼

ELEN E6884: Speech Recognition 102

slide-104
SLIDE 104

Unit IV: Introduction to Finite-State Transducers

Recap

■ WFSA’s and WFST’s can represent many important structures

in ASR

■ composition can do lots of useful things, including . . .

  • transforming arc labels
  • context-dependent expansion
  • adding in new arc scores
  • restricting the set of allowed paths

■❇▼

ELEN E6884: Speech Recognition 103

slide-105
SLIDE 105

Road Map

Where are we going?

■ Unit I: you do not talk about Unit I ■ Unit II: acoustic model training for LVCSR ■ Unit III: decoding for LVCSR (inefficient)

  • Unit IV: introduction to finite-state transducers

Unit V: search (lecture 8)

  • making decoding for LVCSR efficient

■❇▼

ELEN E6884: Speech Recognition 104

slide-106
SLIDE 106

Course Feedback

  • 1. Was this lecture mostly clear or unclear? What was the

muddiest topic?

  • 2. Other feedback (pace, content, atmosphere)?

■❇▼

ELEN E6884: Speech Recognition 105