joint learning of phonetic units and word pronunciations
play

Joint Learning of Phonetic Units and Word Pronunciations for ASR - PowerPoint PPT Presentation

Joint Learning of Phonetic Units and Word Pronunciations for ASR Chia-ying (Jackie) Lee, Yu Zhang and James Glass Spoken Language Systems Group MIT Computer Science and Artificial Intelligence Lab Cambridge, MA 1 World Language Map


  1. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 r l i 0 1 2 1 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 25

  2. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 e l i 0 1 2 1 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 26

  3. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 e l i 0 1 2 1 1 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 26

  4. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 d l i 0 1 2 1 1 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 27

  5. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 d l i 0 1 2 1 1 1 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 27

  6. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 _ l i 0 1 2 1 1 1 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 28

  7. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 _ l i 0 1 2 1 1 1 0 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 28

  8. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 s l i 0 1 2 1 1 1 0 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 29

  9. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 s l i 0 1 2 1 1 1 0 1 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 29

  10. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 o l i 0 1 2 1 1 1 0 1 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 30

  11. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 o l i 0 1 2 1 1 1 0 1 1 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 30

  12. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 x l i 0 1 2 1 1 1 0 1 1 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 31

  13. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 x l i 0 1 2 1 1 1 0 1 1 2 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 31

  14. Generative Process • Step 1 - Generate the number of phones that each letter maps to ( ) n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 32

  15. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 33

  16. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ... 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 33

  17. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ... c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 33

  18. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π r ... c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 34

  19. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π r ~ Dir ( γ ) ... c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 35

  20. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π r ~ Dir ( γ ) ... 3 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 36

  21. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π e ~ Dir ( γ ) ... 3 1 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 37

  22. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π d ~ Dir ( γ ) ... 3 1 17 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 38

  23. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π s ~ Dir ( γ ) ... 3 1 17 2 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 39

  24. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π o ~ Dir ( γ ) ... 3 1 17 2 19 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 40

  25. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π x ~ Dir ( γ ) ... 3 1 17 2 19 56 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 41

  26. Generative Process • Step 2 - Generate the phone label ( ) for every phone that a letter maps to, c i,p 1 ≤ p ≤ n i red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π x ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 42

  27. Generative Process • Step 3 - Generate speech ( ) x t red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t x t θ 1 θ 2 θ 3 θ K 43

  28. Generative Process • Step 3 - Generate speech ( ) x t red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t x t θ 1 θ 2 θ 3 θ K 44

  29. Generative Process • Step 3 - Generate speech ( ) x t red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t x t θ 1 θ 2 θ 3 θ K 45

  30. Generative Process • Step 3 - Generate speech ( ) x t red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 46

  31. Generative Process • Step 3 - Generate speech ( ) x t red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 47

  32. Generative Process • Step 3 - Generate speech ( ) x t red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 48

  33. Generative Process • Step 3 - Generate speech ( ) x t red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 49

  34. Generative Process • Step 3 - Generate speech ( ) x t red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 50

  35. Generative Process • Step 3 - Generate speech ( ) x t red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 51

  36. Generative Process • Step 3 - Generate speech ( ) x t red sox ~ Dir( η ) 𝜚 l i l i 0 1 2 1 1 1 0 1 1 2 n i π l i ~ Dir ( γ ) ... 3 1 17 2 19 56 2 c i,p 1 2 3 K ... x t θ 1 θ 2 θ 3 θ K 51

  37. Context-dependent L2S Rules • Take context into account for learning L2S mapping rules - More specific rules - Natural back-off mechanism red sox π o ... ~ π o 1 2 3 K c i 𝜚 sox ~DP( γ , ) 𝜚 o ... θ 4 ... θ 1 θ 2 θ 3 52

  38. Context-dependent L2S Rules • Take context into account for learning L2S mapping rules - More specific rules - Natural back-off mechanism red sox π o ... ~ π sox 1 2 3 K c i π sox 𝜚 sox ~DP( γ , ) 𝜚 o ... ... 1 2 3 K θ 4 ... θ 1 θ 2 θ 3 53

  39. Context-dependent L2S Rules • Take context into account for learning L2S mapping rules - More specific rules - Back-off mechanism through hierarchy π o ... 1 2 3 K π sox ... 1 2 3 K 54

  40. Context-dependent L2S Rules • Take context into account for learning L2S mapping rules - More specific rules - Back-off mechanism through hierarchy π o ... 1 2 3 K ~ Dir ( απ o ) π sox ... 1 2 3 K 55

  41. Context-dependent L2S Rules • Take context into account for learning L2S mapping rules - More specific rules - Back-off mechanism through hierarchy π o ... • View as the prior of π o π sox 1 2 3 K - If sox appears frequently empirical distribution π sox - If sox is rarely observed ~ Dir ( απ o ) π sox ... π sox π o 1 2 3 K 56

  42. Context-dependent L2S Rules • Take context into account for learning L2S mapping rules - More specific rules ~ Dir ( 𝛿 ) β - Back-off mechanism through hierarchy ~ Dir ( λβ ) π o ... • View as the prior of π o π sox 1 2 3 K - If sox appears frequently empirical distribution π sox - If sox is rarely observed ~ Dir ( απ o ) π sox ... π sox π o 1 2 3 K 57

  43. Graphical Model G : the set of graphemes G × {n,p} l : sequence of three graphemes η γ l i 1 ≤ n ≤ 2 l : observed graphemes 1 ≤ p ≤ n x : observation speech β n i π l ,n,p 𝜚 l d : phone duration G × G × G c : phone id λ n : number of phones a grapheme maps to c i,p π l ,n,p L : total number of graphemes α G × G K : total number of HMMs x t 𝜚 l : 3-dim categorical distribution t = 1... d i θ k θ 0 θ k : a HMM θ 0 : HMM prior p = 1 ... n i K π l ,n,p , π l ,n,p , β : K-dim categorical distribution i = 1 ... L 𝛿 , λ , α : concentration parameter 58

  44. Inference G × {n,p} η γ l i 1 ≤ n ≤ 2 1 ≤ p ≤ n β n i π l ,n,p 𝜚 l G × G × G λ c i,p π l ,n,p α G × G x t t = 1... d i θ k θ 0 p = 1 ... n i K i = 1 ... L 59

  45. Inference G × {n,p} η γ l i 1 ≤ n ≤ 2 1 ≤ p ≤ n β n i π l ,n,p 𝜚 l G × G × G λ c i,p π l ,n,p α G × G x t t = 1... d i θ k θ 0 Latent Regular p = 1 ... n i K model latent i = 1 ... L parameters variables 60

  46. Inference • Procedure G × {n,p} - 20,000 iterations η γ l i 1 ≤ n ≤ 2 1 ≤ p ≤ n β n i π l ,n,p 𝜚 l G × G × G λ c i,p π l ,n,p α G × G x t t = 1... d i θ k θ 0 Latent Regular p = 1 ... n i K model latent i = 1 ... L parameters variables 60

  47. Inference • Procedure G × {n,p} - 20,000 iterations η γ l i 1 ≤ n ≤ 2 1 ≤ p ≤ n Sample from prior β n i π l ,n,p 𝜚 l G × G × G λ c i,p π l ,n,p α G × G x t t = 1... d i θ k θ 0 Latent Regular p = 1 ... n i K model latent i = 1 ... L parameters variables 60

  48. Inference • Procedure G × {n,p} - 20,000 iterations η γ l i 1 ≤ n ≤ 2 1 ≤ p ≤ n Sample from prior β n i π l ,n,p 𝜚 l G × G × G λ Sample given a c i,p π l ,n,p α G × G x t t = 1... d i θ k θ 0 Latent Regular p = 1 ... n i K model latent i = 1 ... L parameters variables 60

  49. Inference • Procedure G × {n,p} - 20,000 iterations η γ l i 1 ≤ n ≤ 2 1 ≤ p ≤ n Sample from prior β n i π l ,n,p 𝜚 l G × G × G λ Sample given a c i,p π l ,n,p α G × G Sample given a x t t = 1... d i θ k θ 0 Latent Regular p = 1 ... n i K model latent i = 1 ... L parameters variables 60

  50. Inference • Procedure G × {n,p} - 20,000 iterations η γ l i 1 ≤ n ≤ 2 1 ≤ p ≤ n Sample from prior β n i π l ,n,p 𝜚 l G × G × G λ Sample given a c i,p π l ,n,p α G × G Sample given a x t t = 1... d i θ k θ 0 Latent Regular p = 1 ... n i K model latent i = 1 ... L parameters variables 60

  51. Inference • Procedure G × {n,p} - 10,000 iterations η γ l i 1 ≤ n ≤ 2 1 ≤ p ≤ n Sample from prior β n i π l ,n,p 𝜚 l G × G × G λ Sample given a c i,p π l ,n,p Block- α sampling G × G Sample given a x t t = 1... d i θ k θ 0 Latent Regular p = 1 ... n i K model latent i = 1 ... L parameters variables 61

  52. Induce Lexicon and Acoustic Model • and define word pronunciations and phone transcriptions n i c i red sox l i x t 62

  53. Induce Lexicon and Acoustic Model • and define word pronunciations and phone transcriptions n i c i red sox l i 1 1 1 0 1 1 2 n i 3 1 17 2 19 56 2 c i x t 63

  54. Induce Lexicon and Acoustic Model • and define word pronunciations and phone transcriptions n i c i red sox l i 1 1 1 0 1 1 2 n i 3 1 17 2 19 56 2 c i x t 64

  55. Induce Lexicon and Acoustic Model • and define word pronunciations and phone transcriptions n i c i red sox l i red : 3 1 17 1 1 1 0 1 1 2 n i sox : 2 19 56 2 3 1 17 2 19 56 2 c i x t 64

  56. Induce Lexicon and Acoustic Model • and define word pronunciations and phone transcriptions n i c i red sox l i red : 3 1 17 1 1 1 0 1 1 2 n i sox : 2 19 56 2 3 1 17 2 19 56 2 c i x t 65

  57. Induce Lexicon and Acoustic Model • and define word pronunciations and phone transcriptions n i c i red sox l i red : 3 1 17 1 1 1 0 1 1 2 n i sox : 2 19 56 2 3 1 17 2 19 56 2 c i ... x t θ 1 θ 2 θ 3 θ K 65

Recommend


More recommend