multilingual acoustic word embedding models for
play

Multilingual acoustic word embedding models for processing - PowerPoint PPT Presentation

Multilingual acoustic word embedding models for processing zero-resource languages ICASSP 2020 Herman Kamper 1 , Yevgen Matusevych 2 , Sharon Goldwater 2 1 Stellenbosch University, South Africa 2 University of Edinburgh, UK


  1. Multilingual acoustic word embedding models for processing zero-resource languages ICASSP 2020 Herman Kamper 1 , Yevgen Matusevych 2 , Sharon Goldwater 2 1 Stellenbosch University, South Africa 2 University of Edinburgh, UK http://www.kamperh.com/

  2. Background: Why acoustic word embeddings? • Current speech recognition methods require large labelled data sets • Zero-resource speech processing aims to develop methods that can discover linguistic structure from unlabelled speech [Dunbar et al., ASRU’17] • Example applications: Unsupervised term discovery, query-by-example • Problem: Need to compare speech segments of variable duration 1 / 11

  3. Acoustic word embeddings Embedding space with z ∈ R M X (1) z (1) z (2) X (2) 2 / 11

  4. Example application: Query-by-example search z ( q ) Query: Nearest Hits Embed neighbour Search database: z (1) , . . . , z ( N ) Embed all segments/ utterances [Levin et al., ICASSP’15] 3 / 11

  5. Supervised and unsupervised acoustic embeddings • Growing body of work on acoustic word embeddings • Supervised and unsupervised methods • Unsupervised methods can be applied in zero-resource settings • But there is still a large performance gap 4 / 11

  6. Supervised and unsupervised acoustic embeddings CAE-RNN 50 • Growing body of work on acoustic word embeddings 40 Average precision (%) • Supervised and unsupervised methods 30 • Unsupervised methods can be applied in 20 zero-resource settings • But there is still a large performance gap 10 0 Unsupervised Supervised [Kamper, ICASSP’19] 4 / 11

  7. Unsupervised monolingual acoustic word embeddings f 1 f 2 f 3 f T X X x 1 x 2 x 3 x T [Chung et al., Interspeech’16; Kamper, ICASSP’19] 5 / 11

  8. Unsupervised monolingual acoustic word embeddings f 1 f 2 f 3 f T ′ X ′ discovered pair X x 1 x 2 x 3 x T [Chung et al., Interspeech’16; Kamper, ICASSP’19] 5 / 11

  9. Supervised multilingual acoustic word embeddings Russian Polish яблоки acoustic word бежать embedding biec jab � l ka z French X courir pommes x 1 x 2 x 3 x T 6 / 11

  10. Experimental setup • Training data: Six well-resourced languages Czech (CS), French (FR), Polish (PL), Portuguese (PT), Russian (RU), Thai (TH) • Test data: Six languages treated as zero-resource Spanish (ES), Hausa (HA), Croatian (HR), Swedish (SV), Turkish (TR), Mandarin (ZH) • Evaluation: Same-different isolated word discrimination • Embeddings: M = 130 for all models • Baselines: — Downsampling: 10 equally spaced MFCCs flattened — Dynamic time warping (DTW) alignment cost between test segments 7 / 11

  11. 1. Is multilingual supervised > monolingual unsupervised? Test results on Spanish Baselines 70 Unsupervised Multilingual 60 Average precision (%) 50 40 30 20 10 0 DTW Downsample CAE-RNN CAE-RNN ClassifierRNN (UTD) (Multiling.) (Multiling.) 8 / 11

  12. 1. Is multilingual supervised > monolingual unsupervised? Test results on Hausa 50 Baselines Unsupervised 40 Multilingual Average precision (%) 30 20 10 0 DTW Downsample CAE-RNN CAE-RNN ClassifierRNN (UTD) (Multiling.) (Multiling.) 8 / 11

  13. 2. Does training on more languages help? Development results on Croatian 50 CAE-RNN ClassifierRNN 40 Average precision (%) 30 20 10 0 HR (UTD) RU RU+CS RU+CS+FR Multilingual Training set 9 / 11

  14. 3. Is the choice of training language important? Evaluation language ES HA HR SV TR ZH 41.6 51.1 41.0 28.7 37.0 42.6 CS FR 42.6 41.8 30.4 25.3 32.5 35.8 Training language 41.1 43.7 35.8 25.5 33.7 39.5 PL 45.9 46.2 36.4 26.6 34.1 39.6 PT RU 35.0 39.7 31.3 22.3 29.7 37.1 28.5 44.5 29.9 17.9 23.6 36.2 TH 10 / 11

  15. Conclusions and future work Conclusions: • Proposed to train a supervised multilingual acoustic word embedding model on well-resourced languages and then apply to zero-resource languages • Multilingual CAE-RNN and ClassifierRNN consistently outperform unsupervised models trained on zero-resource languages 11 / 11

  16. Conclusions and future work Conclusions: • Proposed to train a supervised multilingual acoustic word embedding model on well-resourced languages and then apply to zero-resource languages • Multilingual CAE-RNN and ClassifierRNN consistently outperform unsupervised models trained on zero-resource languages Future work: • Different models both for multilingual and unsupervised training • Analysis to understand the difference between CAE-RNN and ClassifierRNN • Does language conditioning help during decoding? 11 / 11

  17. https://arxiv.org/abs/2002.02109 https://github.com/kamperh/globalphone_awe

Recommend


More recommend