AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss Kaizhi Qian* 1 , Yang Zhang* 23 , Shiyu Chang 23 , Xuesong Yang 1 , Mark Hasegawa-Johnson 1 1 University of Illinois at Urbana-Champaign 2 MIT-IBM Watson AI Lab 3 IBM Research Cambridge
Motivation • Voice conversion aims to modify the source speech to make it sound like being uttered by another speaker.
Motivation • Voice conversion aims to modify the source speech to make it sound like being uttered by another speaker. • Existing voice style transfer techniques: Ø Use complex architectures and training schemes but do not work well for speech Ø Only convert between seen speakers
AutoVC Content Decoder Encoder Speaker Encoder Training
AutoVC Content 𝐷 " 𝑌 " Decoder Encoder Speaker Encoder Training
AutoVC Content 𝐷 " 𝑌 " Decoder Encoder Speaker 𝑌 " 𝑇 " Encoder Training
AutoVC Content 𝐷 " 𝑌 " % Decoder 𝑌 "→" Encoder Speaker 𝑌 " 𝑇 " Encoder Training
AutoVC Content 𝐷 " 𝑌 " % Decoder 𝑌 "→" Encoder 𝑌 " Speaker 𝑌 " 𝑇 " Encoder Training • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "
AutoVC Content 𝐷 " 𝑌 " % Decoder 𝑌 "→" Encoder , 𝐷 "→" 𝑌 " Speaker 𝑌 " 𝑇 " Encoder Training • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "
AutoVC Content Content 𝐷 " 𝑌 " 𝑌 " % 𝐷 " Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑇 " Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "
AutoVC Content Content 𝐷 " 𝑌 " 𝑌 " % 𝐷 " Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑌 ) 𝑇 " 𝑇 ) Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "
AutoVC Content Content 𝐷 " 𝑌 " % 𝑌 " % 𝐷 " 𝑌 "→) Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑌 ) 𝑇 " 𝑇 ) Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "
AutoVC Content Content 𝐷 " 𝑌 " % 𝑌 " % 𝐷 " 𝑌 "→) Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑌 ) 𝑇 " 𝑇 ) Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " " • With bottleneck tuning, AutoVC can match the distribution!
AutoVC Content 𝑌 " % 𝐷 " Decoder 𝑌 "→) Encoder Speaker 𝑌 ) 𝑇 ) Encoder Conversion • Speaker encoder is pretrained
AutoVC Content 𝑌 " % 𝐷 " Decoder 𝑌 "→) Encoder Seen Speaker 𝑌 ) 𝑇 ) OR Encoder Unseen Conversion • Speaker encoder is pretrained • Can generalize to unseen speakers – zero-shot conversion
Conversion Between Seen Speakers Target Converted Source MOS Similarity 4 3 2 1 0 M2M F2F M2F F2M M2M F2F M2F F2M AutoVC AutoVC-one-hot StarGAN 1 Chou et. al. 2 1 StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks 2 Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations
Conversion Between Unseen Speakers • The first zero-shot voice conversion framework Target Converted Source Similarity MOS 4 3 2 1 0 M2M F2F M2F F2M M2M F2F M2F F2M Seen to seen Seen to unseen Unseen to seen Unseen to unseen
Take Away • Autoencoder is all you need to achieve theoretically ideal voice conversion
Take Away • Autoencoder is all you need to achieve theoretically ideal voice conversion • AutoVC generalizes well to unseen speakers
Thank you! Poster #225
Recommend
More recommend