unsupervised neural network based feature extraction
play

Unsupervised neural network based feature extraction using weak - PowerPoint PPT Presentation

Unsupervised neural network based feature extraction using weak top-down constraints Herman Kamper 1 , 2 , Micha Elsner 3 , Aren Jansen 4 , Sharon Goldwater 2 1 CSTR and 2 ILCC, School of Informatics, University of Edinburgh, UK 3 Department of


  1. Unsupervised neural network based feature extraction using weak top-down constraints Herman Kamper 1 , 2 , Micha Elsner 3 , Aren Jansen 4 , Sharon Goldwater 2 1 CSTR and 2 ILCC, School of Informatics, University of Edinburgh, UK 3 Department of Linguistics, The Ohio State University, USA 4 HLTCOE and CLSP, Johns Hopkins University, USA ICASSP 2015

  2. Introduction ◮ Huge amounts of speech audio data are becoming available online. ◮ Even for severely under-resourced and endangered languages (e.g. unwritten), data is being collected. ◮ Generally this data is unlabelled. ◮ We want to build speech technology on available unlabelled data. 2 / 16

  3. Introduction ◮ Huge amounts of speech audio data are becoming available online. ◮ Even for severely under-resourced and endangered languages (e.g. unwritten), data is being collected. ◮ Generally this data is unlabelled. ◮ We want to build speech technology on available unlabelled data. ◮ Need unsupervised speech processing techniques. 2 / 16

  4. Example application: query-by-example search 3 / 16

  5. Example application: query-by-example search Spoken query: 3 / 16

  6. Example application: query-by-example search Spoken query: 3 / 16

  7. Example application: query-by-example search Spoken query: 3 / 16

  8. Example application: query-by-example search Spoken query: 3 / 16

  9. Example application: query-by-example search Spoken query: 3 / 16

  10. Example application: query-by-example search Spoken query: 3 / 16

  11. Example application: query-by-example search Spoken query: What features should we use to represent the speech for such unsupervised tasks? 3 / 16

  12. Supervised neural network feature extraction 4 / 16

  13. Supervised neural network feature extraction Output: predict phone states ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks 4 / 16

  14. Supervised neural network feature extraction Output: predict phone states ay ey k v Feature extractor (learned from data) Input: speech frame(s) e.g. MFCCs, filterbanks 4 / 16

  15. Supervised neural network feature extraction Output: predict phone states ay ey k v Phone classifier (learned jointly) Feature extractor (learned from data) Input: speech frame(s) e.g. MFCCs, filterbanks 4 / 16

  16. Supervised neural network feature extraction Output: predict phone states ay ey k v Phone classifier (learned jointly) Feature extractor (learned from data) Input: speech frame(s) e.g. MFCCs, filterbanks But what if we do not have phone class targets to train our network? 4 / 16

  17. Weak supervision: unsupervised term discovery 5 / 16

  18. Weak supervision: unsupervised term discovery 5 / 16

  19. Weak supervision: unsupervised term discovery 5 / 16

  20. Weak supervision: unsupervised term discovery 5 / 16

  21. Weak supervision: unsupervised term discovery 5 / 16

  22. Weak supervision: unsupervised term discovery 5 / 16

  23. Weak supervision: unsupervised term discovery Can we use these discovered word pairs to provide us with weak supervision? 5 / 16

  24. Weak supervision: align the discovered word pairs Use correspondence idea from [Jansen et al., 2013] 6 / 16

  25. Weak supervision: align the discovered word pairs Use correspondence idea from [Jansen et al., 2013]: 6 / 16

  26. Weak supervision: align the discovered word pairs Use correspondence idea from [Jansen et al., 2013]: 6 / 16

  27. Weak supervision: align the discovered word pairs Use correspondence idea from [Jansen et al., 2013]: 6 / 16

  28. Autoencoder (AE) neural network 7 / 16

  29. Autoencoder (AE) neural network Output is same as input Input speech frame A normal autoencoder neural network is trained to reconstruct its input. 7 / 16

  30. Autoencoder (AE) neural network Output is same as input Input speech frame This reconstruction criterion can be used to pretrain a deep neural network. 7 / 16

  31. The correspondence autoencoder (cAE) Frame from other word in pair Frame from one word The correspondence autoencoder (cAE) takes a frame from one word, and tries to reconstruct the corresponding frame from the other word in the pair. 8 / 16

  32. The correspondence autoencoder (cAE) Frame from other word in pair Unsupervised feature extractor Frame from one word In this way we learn an unsupervised feature extractor using the weak word-pair supervision. 8 / 16

  33. Complete unsupervised cAE training algorithm Train correspondence (1) (4) Train stacked autoencoder autoencoder (pretraining) Initialize weights Speech corpus Unsupervised (3) feature extractor (2) Unsupervised term discovery Align word pair frames 9 / 16

  34. Evaluation of features: the same-different task 10 / 16

  35. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” 10 / 16

  36. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” Treat as query “apple” 10 / 16

  37. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” Treat as terms to search Treat as query “pie” “grape” “apple” “apple” “apple” “like” 10 / 16

  38. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “apple” “like” 10 / 16

  39. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “apple” “like” 10 / 16

  40. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” DTW distance: “pie” d 1 “grape” “apple” “apple” “apple” “like” 10 / 16

  41. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” different d 1 “grape” “apple” “apple” “apple” “like” 10 / 16

  42. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” “apple” “apple” “apple” “like” 10 / 16

  43. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” “apple” “apple” “apple” “like” 10 / 16

  44. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 “apple” “apple” “apple” “like” 10 / 16

  45. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same “apple” “apple” “apple” “like” 10 / 16

  46. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” “apple” “like” 10 / 16

  47. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” “apple” “like” 10 / 16

  48. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” d 3 “apple” “like” 10 / 16

  49. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” d 3 same “apple” “like” 10 / 16

  50. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” � d 3 same “apple” “like” 10 / 16

  51. Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” � d 3 same “apple” d 4 different × “like” � d N different 10 / 16

Recommend


More recommend