unsupervised neural and bayesian models for zero resource
play

Unsupervised neural and Bayesian models for zero-resource speech - PowerPoint PPT Presentation

Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov. 2016 Herman Kamper University of Edinburgh; TTI at Chicago http://www.kamperh.com Speech recognition success 1 / 35 Speech recognition success 1


  1. Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov. 2016 Herman Kamper University of Edinburgh; TTI at Chicago http://www.kamperh.com

  2. Speech recognition success 1 / 35

  3. Speech recognition success 1 / 35

  4. Speech recognition success 1 / 35

  5. Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) 1 / 35

  6. Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours of labelled speech audio; ∼ 350M words of text 1 / 35

  7. Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours of labelled speech audio; ∼ 350M words of text • But: Can we do this for all 7000 languages spoken in the world? 1 / 35

  8. Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology 2 / 35

  9. Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology Criticism: Always some data; semi-supervised problem 2 / 35

  10. Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology Criticism: Always some data; semi-supervised problem Reasons for purely unsupervised case: • Modelling infant language acquisition [R¨ as¨ anen, SpecCom’12] • Language acquisition in robotics [Renkens and Van hamme, IS’15] • Analysis of audio for unwritten languages [Besacier et al., SpecCom’14] • New insights and models for speech processing [Jansen et al., ICASSP’13] 2 / 35

  11. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : 3 / 35

  12. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : 3 / 35

  13. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : f a ( · ) 3 / 35

  14. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 3 / 35

  15. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 2. Unsupervised segmentation and clustering : How do we discover meaningful units in unlabelled speech? 3 / 35

  16. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35

  17. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35

  18. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35

  19. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35

  20. Full-coverage segmentation and clustering 5 / 35

  21. Full-coverage segmentation and clustering 5 / 35

  22. Full-coverage segmentation and clustering 5 / 35

  23. Unsupervised speech processing: Two problems 1. Unsupervised frame-level Cool model representation learning : f a ( · ) 2. Unsupervised segmentation and clustering : We focus on full-coverage segmentation and clustering 6 / 35

  24. Unsupervised speech processing: Two problems 1. Unsupervised frame-level Cool model representation learning : f a ( · ) 2. Unsupervised segmentation and clustering : We focus on full-coverage segmentation and clustering Our claim: Unsupervised speech processing benefits from both top-down and bottom-up modelling 6 / 35

  25. Top-down and bottom-up modelling Top-down: Use knowledge of higher-level units to learn about lower-level parts Bottom-up: Piece together lower-level parts to get more complex higher-level structures [Feldman et al., CCSS’09] 7 / 35

  26. Unsupervised frame-level representation learning: The Correspondence Autoencoder

  27. Unsupervised frame-level representation learning: The Correspondence Autoencoder Micha Elsner Daniel Renshaw Aren Jansen Sharon Goldwater

  28. Supervised representation learning using DNN Output: predict phone states ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35

  29. Supervised representation learning using DNN Output: predict phone states ay ey k v Phone classifier learned jointly Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35

  30. Supervised representation learning using DNN Output: predict phone states ay ey k v Phone classifier learned jointly Unsupervised modelling: No phone class targets to train network on Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35

  31. Autoencoder (AE) neural network Reconstruct input Input speech frame [Badino et al., ICASSP’14] 10 / 35

  32. Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? Input speech frame [Badino et al., ICASSP’14] 10 / 35

  33. Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? • Idea: Unsupervised term discovery Input speech frame [Badino et al., ICASSP’14] 10 / 35

  34. Unsupervised term discovery (UTD) 11 / 35

  35. Unsupervised term discovery (UTD) Can we use these discovered word pairs to give weak top-down supervision? 11 / 35

  36. Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 12 / 35

  37. Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 12 / 35

  38. Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 12 / 35

  39. Autoencoder (AE) Reconstruct input Input speech frame 13 / 35

  40. Correspondence autoencoder (cAE) Frame from other word in pair Frame from one word 14 / 35

  41. Correspondence autoencoder (cAE) Frame from other word in pair Unsupervised feature extractor f a ( · ) Frame from one word 14 / 35

  42. Correspondence autoencoder (cAE) Frame from other word in pair Combine top-down and bottom-up information Unsupervised feature extractor f a ( · ) Frame from one word 14 / 35

  43. Correspondence autoencoder (cAE) Train correspondence (1) (4) Train stacked autoencoder autoencoder (pretraining) Initialize weights Speech corpus (3) Unsupervised feature extractor (2) Unsupervised term discovery Align word pair frames [Kamper et al., ICASSP’15] 15 / 35

  44. Intrinsic evaluation: Isolated word query task 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE 16 / 35

  45. Intrinsic evaluation: Isolated word query task 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE Extended: [Renshaw et al., IS’15] and [Yuan et al., IS’16] 16 / 35

  46. Unsupervised segmentation and clustering: The Segmental Bayesian Model

  47. Unsupervised segmentation and clustering: The Segmental Bayesian Model Aren Jansen Sharon Goldwater

  48. Full-coverage segmentation and clustering 18 / 35

  49. Full-coverage segmentation and clustering 18 / 35

  50. Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : 19 / 35

  51. Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : Our approach uses whole-word segmental representations, i.e. acoustic word embeddings [Kamper et al., IS’15; Kamper et al., TASLP’16] 19 / 35

  52. Acoustic word embeddings 20 / 35

  53. Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) 20 / 35

  54. Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) Dynamic programming alignment has quadratic complexity, while embedding comparison is linear time. Can use standard clustering. 20 / 35

  55. Unsupervised segmental Bayesian model Speech waveform 21 / 35

  56. Unsupervised segmental Bayesian model Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

  57. Unsupervised segmental Bayesian model Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

  58. Unsupervised segmental Bayesian model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

  59. Unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

  60. Unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Acoustic modelling Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

Recommend


More recommend