AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss - PowerPoint PPT Presentation

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss Kaizhi Qian* 1 , Yang Zhang* 23 , Shiyu Chang 23 , Xuesong Yang 1 , Mark Hasegawa-Johnson 1 1 University of Illinois at Urbana-Champaign 2 MIT-IBM Watson AI Lab 3 IBM Research Cambridge

Motivation • Voice conversion aims to modify the source speech to make it sound like being uttered by another speaker.

Motivation • Voice conversion aims to modify the source speech to make it sound like being uttered by another speaker. • Existing voice style transfer techniques: Ø Use complex architectures and training schemes but do not work well for speech Ø Only convert between seen speakers

AutoVC Content Decoder Encoder Speaker Encoder Training

AutoVC Content 𝐷 " 𝑌 " Decoder Encoder Speaker Encoder Training

AutoVC Content 𝐷 " 𝑌 " Decoder Encoder Speaker 𝑌 " 𝑇 " Encoder Training

AutoVC Content 𝐷 " 𝑌 " % Decoder 𝑌 "→" Encoder Speaker 𝑌 " 𝑇 " Encoder Training

AutoVC Content 𝐷 " 𝑌 " % Decoder 𝑌 "→" Encoder 𝑌 " Speaker 𝑌 " 𝑇 " Encoder Training • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "

AutoVC Content 𝐷 " 𝑌 " % Decoder 𝑌 "→" Encoder , 𝐷 "→" 𝑌 " Speaker 𝑌 " 𝑇 " Encoder Training • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "

AutoVC Content Content 𝐷 " 𝑌 " 𝑌 " % 𝐷 " Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑇 " Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "

AutoVC Content Content 𝐷 " 𝑌 " 𝑌 " % 𝐷 " Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑌 ) 𝑇 " 𝑇 ) Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "

AutoVC Content Content 𝐷 " 𝑌 " % 𝑌 " % 𝐷 " 𝑌 "→) Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑌 ) 𝑇 " 𝑇 ) Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " "

AutoVC Content Content 𝐷 " 𝑌 " % 𝑌 " % 𝐷 " 𝑌 "→) Decoder Decoder 𝑌 "→" Encoder Encoder , 𝐷 "→" 𝑌 " Speaker Speaker 𝑌 " 𝑌 ) 𝑇 " 𝑇 ) Encoder Encoder Training Conversion • Train only on self-reconstruction Loss: ) + 𝜇 % , 𝔽 𝑌 "→" − 𝑌 " ) 𝐷 "→" − 𝐷 " " • With bottleneck tuning, AutoVC can match the distribution!

AutoVC Content 𝑌 " % 𝐷 " Decoder 𝑌 "→) Encoder Speaker 𝑌 ) 𝑇 ) Encoder Conversion • Speaker encoder is pretrained

AutoVC Content 𝑌 " % 𝐷 " Decoder 𝑌 "→) Encoder Seen Speaker 𝑌 ) 𝑇 ) OR Encoder Unseen Conversion • Speaker encoder is pretrained • Can generalize to unseen speakers – zero-shot conversion

Conversion Between Seen Speakers Target Converted Source MOS Similarity 4 3 2 1 0 M2M F2F M2F F2M M2M F2F M2F F2M AutoVC AutoVC-one-hot StarGAN 1 Chou et. al. 2 1 StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks 2 Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

Conversion Between Unseen Speakers • The first zero-shot voice conversion framework Target Converted Source Similarity MOS 4 3 2 1 0 M2M F2F M2F F2M M2M F2F M2F F2M Seen to seen Seen to unseen Unseen to seen Unseen to unseen

Take Away • Autoencoder is all you need to achieve theoretically ideal voice conversion

Take Away • Autoencoder is all you need to achieve theoretically ideal voice conversion • AutoVC generalizes well to unseen speakers

Thank you! Poster #225

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss - PowerPoint PPT Presentation

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss Kaizhi Qian* 1 , Yang Zhang* 23 , Shiyu Chang 23 , Xuesong Yang 1 , Mark Hasegawa-Johnson 1 1 University of Illinois at Urbana-Champaign 2 MIT-IBM Watson AI Lab 3 IBM Research

style#1 grace style#2 freya style#3 iona style#4 skye style#5 cora style#6 maisie style#7 isla

Zero-Shot Learning for Word Translation: Successes and Failures Ndapa Nakashole, University of

SHOT Brand Price NOTES WEST COAST MAGNUM SIZES 4 - 9 $ 39.20 Eagle shot prices may not be

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

There is a voice speaking. That voice is sovereign. That voice alone is sovereign. Jeremiah

Transfer Functions Transfer Functions Assume zero initial conditions. Transfer functions

Click to edit Master title style DRVR Click to edit Master title style Click to edit Master

Style le GAN Prof. Leal-Taix and Prof. Niessner 1 Style leGAN Style-based generator

Integrating Semantic Knowledge to Tackle Zero-shot Text Classification Jingqing Zhang*, Piyawat

Semantic Spaces for Zero-Shot Behaviour Analysis Xun Xu Computer Vision and Interactive Media

Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions Jimmy Lei Ba,

Federated Zero-Shot Learning: A Proposal Francesco Odierna CS PhD student @ University of Pisa

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

NC MULTI - SLIDES PROGRAMMABLE CMP 250 NOVELTY Pressmac recently has developed a new version of

Building stuff with monadic dependencies + unchanging dependencies + polymorphic dependencies +

Linear Programming: Introduction Frdric Giroire F . Giroire LP - Introduction 1/28 Course

PLEASE DO NOT COPY Setting up aTMS Clinic Daniel Press, M.D. Assistant Professor in Neurology,

A High Performance Packet Core for Next Generation Cellular Networks Zafar Qazi + Melvin Walls ,

CAP for Mobility Support Yuanjie Li 1 , Zengwen Yuan 1 , Chunyi Peng 2 , Songwu Lu 1 1 University

TurboEPC: Leveraging Network Programmability to Accelerate the Mobile Packet Core Rinku Shah ,

S G s i n t e r f a c e i n O s mo MS C Philipp Maier <pmaier@sysmocom.de>

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss - PowerPoint PPT Presentation

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss Kaizhi Qian* 1 , Yang Zhang* 23 , Shiyu Chang 23 , Xuesong Yang 1 , Mark Hasegawa-Johnson 1 1 University of Illinois at Urbana-Champaign 2 MIT-IBM Watson AI Lab 3 IBM Research

style#1 grace style#2 freya style#3 iona style#4 skye style#5 cora style#6 maisie style#7 isla

Zero-Shot Learning for Word Translation: Successes and Failures Ndapa Nakashole, University of

SHOT Brand Price NOTES WEST COAST MAGNUM SIZES 4 - 9 $ 39.20 Eagle shot prices may not be

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

There is a voice speaking. That voice is sovereign. That voice alone is sovereign. Jeremiah

Transfer Functions Transfer Functions Assume zero initial conditions. Transfer functions

Click to edit Master title style DRVR Click to edit Master title style Click to edit Master

Style le GAN Prof. Leal-Taix and Prof. Niessner 1 Style leGAN Style-based generator

Integrating Semantic Knowledge to Tackle Zero-shot Text Classification Jingqing Zhang*, Piyawat

Semantic Spaces for Zero-Shot Behaviour Analysis Xun Xu Computer Vision and Interactive Media

Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions Jimmy Lei Ba,

Federated Zero-Shot Learning: A Proposal Francesco Odierna CS PhD student @ University of Pisa

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

NC MULTI - SLIDES PROGRAMMABLE CMP 250 NOVELTY Pressmac recently has developed a new version of

Building stuff with monadic dependencies + unchanging dependencies + polymorphic dependencies +

Linear Programming: Introduction Frdric Giroire F . Giroire LP - Introduction 1/28 Course

PLEASE DO NOT COPY Setting up aTMS Clinic Daniel Press, M.D. Assistant Professor in Neurology,

A High Performance Packet Core for Next Generation Cellular Networks Zafar Qazi + Melvin Walls ,

CAP for Mobility Support Yuanjie Li 1 , Zengwen Yuan 1 , Chunyi Peng 2 , Songwu Lu 1 1 University

TurboEPC: Leveraging Network Programmability to Accelerate the Mobile Packet Core Rinku Shah ,

S G s i n t e r f a c e i n O s mo MS C Philipp Maier &lt;pmaier@sysmocom.de&gt;

S G s i n t e r f a c e i n O s mo MS C Philipp Maier <pmaier@sysmocom.de>