improving unsupervised acoustic word embeddings using
play

Improving Unsupervised Acoustic Word Embeddings using Speaker and - PowerPoint PPT Presentation

Improving Unsupervised Acoustic Word Embeddings using Speaker and Gender Information Lisa van Staden, Herman Kamper 31 January 2020 Zero-Resource Speech Processing Popular methods for speech processing rely on transcribed speech. Obtaining


  1. Improving Unsupervised Acoustic Word Embeddings using Speaker and Gender Information Lisa van Staden, Herman Kamper 31 January 2020

  2. Zero-Resource Speech Processing Popular methods for speech processing rely on transcribed speech. Obtaining transcriptions is expensive and not always possible. 1

  3. Zero-Resource Speech Processing Popular methods for speech processing rely on transcribed speech. Obtaining transcriptions is expensive and not always possible. 1

  4. Tasks in Zero-Resource Processing We don’t always need to predict text labels: • Query-by-Example Search: search speech using speech. • Unsupervised Term Discovery: Discover repeating patterns in speech. 2

  5. Tasks in Zero-Resource Processing We don’t always need to predict text labels: • Query-by-Example Search: search speech using speech. • Unsupervised Term Discovery: Discover repeating patterns in speech. 2

  6. Tasks in Zero-Resource Processing We don’t always need to predict text labels: • Query-by-Example Search: search speech using speech. • Unsupervised Term Discovery: Discover repeating patterns in speech. 2

  7. Speech Segment Comparison These tasks require comparing speech segments. The conventional method is Dynamic Time Warping. • Computationally expensive. 3

  8. Speech Segment Comparison These tasks require comparing speech segments. The conventional method is Dynamic Time Warping. • Computationally expensive. 3

  9. Acoustic Word Embeddings We want to map speech to these representation without using labels. 4

  10. Acoustic Word Embeddings We want to map speech to these representation without using labels. 4

  11. Speaker and Gender Information Acoustic properties of speech from difgerent speakers/genders difger. We want embeddings to be robust. 5 cat pan pan pun cat bat Speaker A Speaker B Female Male Male

  12. RNN (Correspondence) Autoencoder 6 x 1 x 2 x T GRU GRU GRU GRU embedding GRU GRU Encoder Decoder x 1 ' / y 1 ' x 2 ' / y 2 ' x T ' / y T '

  13. Speaker/Gender Conditioning 7 Speaker\Gender x 1 x 2 x T GRU GRU GRU GRU embedding GRU GRU Encoder Decoder x 1 ' / y 1 ' x 2 ' / y 2 ' x T ' / y T '

  14. 8 Adversarial Training Turn A X Encoder Turn B p Embedding Classifier Decoder X ' /Y '

  15. Speaker/Gender Classifier 9 z p Linear ReLU Dropout Softmax

  16. Evaluating Quality of AWEs Use the same-difgerent task to evaluate AWEs: • Measure if AWEs are similar given a threshold. • Calculate area under Precision vs Recall curve. 10

  17. Results 11 30.49 30.18 29.72 English 30 28.98 Xitsonga 25.53 25.38 25.19 25 22.72 22.52 Average Precision (%) 20 15 12.78 11.65 11.22 10 5 0 AE-Baseline AE-Top-1 AE-Top-2 CAE-Baseline CAE-Top-1 CAE-Top-2 Model Type

  18. Evaluate Speaker and Gender Predictability Analyse if the speaker and gender information has decreased: • Use speaker/gender classifier model. • Evaluate accuracy. 12

  19. Average Precision vs Speaker/Gender Predictability AE CAE 13 27.0 32.0 26.8 31.5 Average Precision Average Precision 26.6 31.0 26.4 26.2 30.5 26.0 30.0 72 74 76 78 80 82 84 68 70 72 74 76 78 80 82 84 Speaker Predictability Speaker Predictability 27.0 32.0 26.8 31.5 Average Precision Average Precision 26.6 31.0 26.4 26.2 30.5 26.0 30.0 89.5 90.0 90.5 91.0 91.5 92.0 92.5 93.0 93.5 88 89 90 91 92 93 Gender Predictability Gender Predictability

  20. Conclusions • English data shows marginal improvement by incorporating speaker information. • Best Xitsonga model shows 22% improvement. • It’s diffjcult to remove speaker and gender information. • Future work ... 14

  21. Conclusions • English data shows marginal improvement by incorporating speaker information. • Best Xitsonga model shows 22% improvement. • It’s diffjcult to remove speaker and gender information. • Future work ... 14

  22. Conclusions • English data shows marginal improvement by incorporating speaker information. • Best Xitsonga model shows 22% improvement. • It’s diffjcult to remove speaker and gender information. • Future work ... 14

  23. Conclusions • English data shows marginal improvement by incorporating speaker information. • Best Xitsonga model shows 22% improvement. • It’s diffjcult to remove speaker and gender information. • Future work ... 14

Recommend


More recommend