unsupervised neural feature learning for speech using
play

Unsupervised neural feature learning for speech using weak top-down - PowerPoint PPT Presentation

Unsupervised neural feature learning for speech using weak top-down constraints Maties Machine Learning (MML), Oct. 2017 Herman Kamper Stellenbosch University http://www.kamperh.com/ Success in speech recognition 1 / 18 Success in speech


  1. Unsupervised neural feature learning for speech using weak top-down constraints Maties Machine Learning (MML), Oct. 2017 Herman Kamper Stellenbosch University http://www.kamperh.com/

  2. Success in speech recognition 1 / 18

  3. Success in speech recognition 1 / 18

  4. Success in speech recognition 1 / 18

  5. Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] 1 / 18

  6. Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] 1 / 18

  7. Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) 1 / 18

  8. Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • An addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text 1 / 18

  9. Success in speech recognition i had to think of some example speech since speech recognition is really cool [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • An addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text 1 / 18

  10. Success in speech recognition i had to think of some example speech since speech recognition is really cool [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • An addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text • But, there are around 7000 languages spoken in the world today 1 / 18

  11. Why learn without labels? 3 / 18

  12. Why learn without labels? • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] 3 / 18

  13. Why learn without labels? • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15] 3 / 18

  14. Why learn without labels? • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15] • Analysis of audio for unwritten languages [Besacier et al., ’14] 3 / 18

  15. Why learn without labels? • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15] • Analysis of audio for unwritten languages [Besacier et al., ’14] • New insights and models for speech processing [Jansen et al., ’13] 3 / 18

  16. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 18

  17. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 18

  18. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 18

  19. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 18

  20. Example: Query-by-example search [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17] 5 / 18

  21. Example: Query-by-example search Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17] 5 / 18

  22. Example: Query-by-example search Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17] 5 / 18

  23. Example: Query-by-example search Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17] 5 / 18

  24. Example: Query-by-example search Spoken query: Useful speech system, not requiring any transcribed speech [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17] 5 / 18

  25. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : 6 / 18

  26. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : 6 / 18

  27. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : f a ( · ) 6 / 18

  28. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 6 / 18

  29. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 2. Unsupervised segmentation and clustering : How do we discover meaningful units in unlabelled speech? 6 / 18

  30. Unsupervised frame-level representation learning: The Correspondence Autoencoder

  31. Unsupervised frame-level representation learning: The Correspondence Autoencoder Micha Elsner Daniel Renshaw Aren Jansen Sharon Goldwater

  32. Supervised representation learning using DNNs Output: predict phone states ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks 8 / 18

  33. Supervised representation learning using DNNs Output: predict phone states ay ey k v Phone classifier learned jointly Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 8 / 18

  34. Supervised representation learning using DNNs Output: predict phone states ay ey k v Phone classifier learned jointly Unsupervised modelling: No phone class targets to train network on Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 8 / 18

  35. Autoencoder (AE) neural network Reconstruct input Input speech frame [Badino et al., ICASSP’14] 9 / 18

  36. Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? Input speech frame [Badino et al., ICASSP’14] 9 / 18

  37. Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? • Idea: Unsupervised term discovery Input speech frame [Badino et al., ICASSP’14] 9 / 18

  38. Unsupervised term discovery (UTD) 10 / 18

  39. Unsupervised term discovery (UTD) Can we use these discovered word pairs to give weak top-down supervision? 10 / 18

  40. Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 11 / 18

  41. Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 11 / 18

  42. Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 11 / 18

  43. Autoencoder (AE) Reconstruct input Input speech frame 12 / 18

  44. Correspondence autoencoder (cAE) Frame from other word in pair Frame from one word 13 / 18

  45. Correspondence autoencoder (cAE) Frame from other word in pair Unsupervised feature extractor f a ( · ) Frame from one word 13 / 18

  46. Correspondence autoencoder (cAE) Frame from other word in pair Combine top-down and bottom-up information Unsupervised feature extractor f a ( · ) Frame from one word 13 / 18

  47. Correspondence autoencoder (cAE) Frame from other word in pair Play Unsupervised feature extractor Play Frame from one word 14 / 18

  48. Correspondence autoencoder (cAE) Train correspondence (1) (4) Train stacked autoencoder autoencoder (pretraining) Initialize weights Speech corpus (3) Unsupervised feature extractor (2) Unsupervised term discovery Align word pair frames [Kamper et al., ICASSP’15] 15 / 18

  49. Evaluation: Query-by-example search Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17] 16 / 18

  50. Evaluation: Isolated word query-by-example 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE 17 / 18

  51. Evaluation: Isolated word query-by-example 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE Extended: [Renshaw et al., IS’15] and [Yuan et al., IS’16] 17 / 18

  52. Summary and conclusion • Introduced correspondence autoencoder (cAE) for unsupervised frame-level representation learning • Uses top-down information from unsupervised term discovery system • Uses bottom-up initialization on large speech corpus • Unsupervised neural network model that combines top-down and bottom-up information results in large intrinsic improvements • Links with language acquisition research • Future: More analysis; different domains; practical search systems 18 / 18

  53. http://www.kamperh.com/ https://github.com/kamperh

  54. Evaluation of features: same-different task

  55. Evaluation of features: same-different task “apple” “pie” “grape” “apple” “apple” “like”

  56. Evaluation of features: same-different task “apple” “pie” “grape” “apple” “apple” “like” Treat as query “apple”

  57. Evaluation of features: same-different task “apple” “pie” “grape” “apple” “apple” “like” Treat as terms to search Treat as query “pie” “grape” “apple” “apple” “apple” “like”

  58. Evaluation of features: same-different task “apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “apple” “like”

  59. Evaluation of features: same-different task “apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “apple” “like”

Recommend


More recommend