phonetic modeling in asr
play

Phonetic Modeling in ASR Chuck Wooters 3/16/05 EECS 225d - PowerPoint PPT Presentation

Phonetic Modeling in ASR Chuck Wooters 3/16/05 EECS 225d Introduction VARIATION The central issue in Automatic Speech Recognition EECS 225d - March 16, 2005 2 Many Types of Variation channel/microphone type environmental noise


  1. Phonetic Modeling in ASR Chuck Wooters 3/16/05 EECS 225d

  2. Introduction VARIATION The central issue in Automatic Speech Recognition EECS 225d - March 16, 2005 2

  3. Many Types of Variation channel/microphone type environmental noise speaking style vocal anatomy gender accent health etc. EECS 225d - March 16, 2005 3

  4. Focus Today “You say pot[ey]to, I say pot[a]to... ” How can we model variation in pronunciation? EECS 225d - March 16, 2005 4

  5. Pronunciation Variation A careful transcription of conversational speech by trained linguists has revealed... EECS 225d - March 16, 2005 5

  6. 80 Ways To Say “and” From “SPEAKING IN SHORTHAND - A SYLLABLE-CENTRIC PERSPECTIVE FOR UNDERSTANDING PRONUNCIATION VARIATION” by Steve Greenberg EECS 225d - March 16, 2005 6

  7. Outline Phonetic Modeling Sub-Word models Phones (mono-, bi-, di- and triphones) Syllables Data-driven units Cross-word modeling Whole-word models Lexicons (Dictionaries) for ASR EECS 225d - March 16, 2005 7

  8. Phonetic Modeling EECS 225d - March 16, 2005 8

  9. Phonetic Modeling How do we select the basic units for recognition? Units should be accurate Units should be trainable Units should be generalizable We often have to balance these against each other. EECS 225d - March 16, 2005 9

  10. Sub-Word Models EECS 225d - March 16, 2005 10

  11. Sub-Word Models Phones Context Independent Context Dependent Syllables Data-driven units Cross-word modeling EECS 225d - March 16, 2005 11

  12. Phones EECS 225d - March 16, 2005 12

  13. Phones Note: “phones” != “phonemes” (see G&M pg. 310) E.g.: Phoneme Phone A A A A A Ascii-65 EECS 225d - March 16, 2005 13

  14. “Flavors” of Phones Context Independent: Monophones Context Dependent: Biphones Diphones Triphones EECS 225d - March 16, 2005 14

  15. Context Independent Phones EECS 225d - March 16, 2005 15

  16. Context Independent “Monophones” “cat” = [k ae t] Easy to train: only about 40 monophones for English The basis of other sub-word units Easy to add new pronunciations to lexicon EECS 225d - March 16, 2005 16

  17. Typical English Phone Set Phone Example Phone Example Phone Example iy ih ae feel fill gas aa ah ao father bud caught ay ax ey bite comply day eh er ow ten turn tone aw oy uh how coin book uw b p tool big pig d t g dig sat gut k f v cut fork vat s z th sit zap thin dh sh zh then she genre l r y lid red yacht w hh m with help mat n ng ch no sing chin jh edge Adapted from “Spoken Language Processing” by Xuedong Huang, et. al. EECS 225d - March 16, 2005 17

  18. Monophones Major Drawback Not very powerful for modeling variation: Example: “key” vs “coo” EECS 225d - March 16, 2005 18

  19. Context Dependent Phones EECS 225d - March 16, 2005 19

  20. Biphones Taking into account the context (what sounds are to the right or left) in which the phone occurs. Left biphone of [ae] in “cat”: k_ae Right biphone of [ae] in “cat”: ae_t “key” = k_iy iy_# “coo” = k_uw uw_# EECS 225d - March 16, 2005 20

  21. Biphones More difficult to train than monophones: Roughly (40^2 + 40^2) biphones for English If not enough training for a biphone model, can “backoff” to monophone EECS 225d - March 16, 2005 21

  22. Triphones Consider the sounds to the left AND right Good modeling of variation Most widely used in ASR systems “key” = #_k_iy k_iy_# “coo” = #_k_uw k_uw_# EECS 225d - March 16, 2005 22

  23. Triphones Can be difficult to train: there are LOTS of possible triphones (roughly 40^3) Not all occur If not enough data to train a triphone, typically back-off to left or right biphone EECS 225d - March 16, 2005 23

  24. Triphones Don’t always capture variation: “that rock” vs. “theatrical” ae_t_r ae_t_r Sometimes helps to cluster similar triphones EECS 225d - March 16, 2005 24

  25. Diphones Modeling the transitions between phones Extend from middle of one phone to the middle of the next “key” = #_k k_iy iy_# “coo” = #_k k_uw uw_# EECS 225d - March 16, 2005 25

  26. Syllables EECS 225d - March 16, 2005 26

  27. Syllables Syllable Rime [Onset] Nucleus [Coda] str eh ng th s “Strengths” EECS 225d - March 16, 2005 27

  28. Syllables Good modeling of variation Somewhere between triphones and whole- word models Can be difficult to train (like triphones) Practical experiments have not shown improvements over triphone-based systems. EECS 225d - March 16, 2005 28

  29. Data-driven Sub-Word Units EECS 225d - March 16, 2005 29

  30. Data-driven Sub-Word Units Basic Idea: More accurate modeling of acoustic variation Cluster data into homogeneous “groups” sounds with similar acoustics should group together Use these automatically-derived units instead of linguistically-based sub-word units EECS 225d - March 16, 2005 30

  31. Data-driven Sub-Word Units Difficulties: Can have problems with training, depending on number of units Real problem: generalizability How do we add words to the system when we don’t know what the units “mean” Create a mapping from phones? EECS 225d - March 16, 2005 31

  32. Cross-word Modeling EECS 225d - March 16, 2005 32

  33. Cross-word Modeling Co-articulation spans word boundaries: “Did you eat yet?” -> jeatyet “could you” -> couldja “I don’t know” -> idunno We can achieve better modeling by looking across word boundaries More difficult to implement- what would dictionary look like? Usually use lattices when doing cross-word modeling EECS 225d - March 16, 2005 33

  34. Whole-word Models EECS 225d - March 16, 2005 34

  35. Whole-word Models In some sense, the most “natural” unit Good modeling of coarticulation within the word If context dependent, good modeling across words Good when vocabulary is small e.g. digits: 10 words Context dependent: 10x10x10 = 1000 models Not a huge problem for training EECS 225d - March 16, 2005 35

  36. Whole-word Models Problems: difficult to train: needs lots of examples of *every* word not generalizable: adding new words requires more data collection EECS 225d - March 16, 2005 36

  37. Lexicons EECS 225d - March 16, 2005 37

  38. Lexicons for ASR cat: k ae t Contains: key: k ey words coo: k uw pronunciations the: 0.6 dh iy optionally: 0.4 dh ax alternate pronunciations pronunciation probabilities No definitions EECS 225d - March 16, 2005 38

  39. Lexicon Generation Where do lexical entries come from? Hand labeling Rule generated Not too bad for English, but can be a big expense when building a recognizer for a new language For a small task, may want to consider whole-word models to bypass lexicon gen EECS 225d - March 16, 2005 39

Recommend


More recommend