autovc zero shot voice style transfer with only
play

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss - PowerPoint PPT Presentation

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss Kaizhi Qian* 1 , Yang Zhang* 23 , Shiyu Chang 23 , Xuesong Yang 1 , Mark Hasegawa-Johnson 1 1 University of Illinois at Urbana-Champaign 2 MIT-IBM Watson AI Lab 3 IBM Research


  1. AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss Kaizhi Qian* 1 , Yang Zhang* 23 , Shiyu Chang 23 , Xuesong Yang 1 , Mark Hasegawa-Johnson 1 1 University of Illinois at Urbana-Champaign 2 MIT-IBM Watson AI Lab 3 IBM Research Cambridge

  2. Motivation • Voice conversion aims to modify the source speech to make it sound like being uttered by another speaker.

  3. Motivation • Voice conversion aims to modify the source speech to make it sound like being uttered by another speaker. • Existing voice style transfer techniques: Ø Use complex architectures and training schemes but do not work well for speech Ø Only convert between seen speakers

  4. AutoVC Content Decoder Encoder Speaker Encoder Training

  5. AutoVC Content 𝐷 " 𝑌 " Decoder Encoder Speaker Encoder Training

  6. AutoVC Content 𝐷 " 𝑌 " Decoder Encoder Speaker 𝑌 " 𝑇 " Encoder Training

  7. AutoVC Content 𝐷 " 𝑌 " % Decoder 𝑌 "→" Encoder Speaker 𝑌 " 𝑇 " Encoder Training

  8. AutoVC Content 𝐷 " 𝑌 " % Decoder 𝑌 "→" Encoder 𝑌 " Speaker 𝑌 " 𝑇 " Encoder Training • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "

  9. AutoVC Content 𝐷 " 𝑌 " % Decoder 𝑌 "→" Encoder , 𝐷 "→" 𝑌 " Speaker 𝑌 " 𝑇 " Encoder Training • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "

  10. AutoVC Content Content 𝐷 " 𝑌 " 𝑌 " % 𝐷 " Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑇 " Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "

  11. AutoVC Content Content 𝐷 " 𝑌 " 𝑌 " % 𝐷 " Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑌 ) 𝑇 " 𝑇 ) Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "

  12. AutoVC Content Content 𝐷 " 𝑌 " % 𝑌 " % 𝐷 " 𝑌 "→) Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑌 ) 𝑇 " 𝑇 ) Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "

  13. AutoVC Content Content 𝐷 " 𝑌 " % 𝑌 " % 𝐷 " 𝑌 "→) Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑌 ) 𝑇 " 𝑇 ) Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " " • With bottleneck tuning, AutoVC can match the distribution!

  14. AutoVC Content 𝑌 " % 𝐷 " Decoder 𝑌 "→) Encoder Speaker 𝑌 ) 𝑇 ) Encoder Conversion • Speaker encoder is pretrained

  15. AutoVC Content 𝑌 " % 𝐷 " Decoder 𝑌 "→) Encoder Seen Speaker 𝑌 ) 𝑇 ) OR Encoder Unseen Conversion • Speaker encoder is pretrained • Can generalize to unseen speakers – zero-shot conversion

  16. Conversion Between Seen Speakers Target Converted Source MOS Similarity 4 3 2 1 0 M2M F2F M2F F2M M2M F2F M2F F2M AutoVC AutoVC-one-hot StarGAN 1 Chou et. al. 2 1 StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks 2 Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

  17. Conversion Between Unseen Speakers • The first zero-shot voice conversion framework Target Converted Source Similarity MOS 4 3 2 1 0 M2M F2F M2F F2M M2M F2F M2F F2M Seen to seen Seen to unseen Unseen to seen Unseen to unseen

  18. Take Away • Autoencoder is all you need to achieve theoretically ideal voice conversion

  19. Take Away • Autoencoder is all you need to achieve theoretically ideal voice conversion • AutoVC generalizes well to unseen speakers

  20. Thank you! Poster #225

Recommend


More recommend