Multi-target Voice Conversion without Parallel Data by Adversarially - PowerPoint PPT Presentation

Setting - 2 Task Setting - 1 Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations Method Ju-chieh Chou , Cheng-chieh Yeh, Hung-yi Lee, Lin-shan Lee Best student paper award nominated in Interspeech 2018. Speech Processing Laboratory, National Taiwan University

Outline ● Introduction ○ Convertional: supervised with paired data ○ This work: unsupervised with non-parallel data ○ This work: multi-target with non-parallel data ● Multi-target scenario (our contribution) ○ Model ○ Experiments

Voice conversion ● Change the characteristic of an utterance while maintaining the linguistic content the same. ● Characteristic: accent, speaker identity, emotion… ● This work: focus on speaker identity conversion. Speaker A Speaker 1 Model How are you How are you

Conventional: supervised with paired data Speaker A Speaker 1 ● Same sentences, different signal from 2 speakers. ● Problem: require paired data, which is hard to collect. How are you How are you Paired data Nice to meet you Nice to meet you I am fine I am fine

This work: unsupervised with non-parallel data ● Trained on non-parallel corpus, which is more attainable. ● Actively investigated. ● Prior work: utilize deep generative model, ex. VAE, GAN, cycleGAN [1]. Speaker 1 Speaker A Don’t have to speak same sentences. CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks. Kaneko et.al. EUSIPCO 2018 [1]

This work: multi-target unsupervised with non-parallel data 3 models are needed for 3 target speakers. Model-A Only one model is needed. Speaker A Speaker 1 Speaker A Model-B Model Speaker B Speaker B Speaker 1 Speaker C Model-C Speaker C 𝑂 2 models for N speakers.

Outline ● Introduction ○ Convertional: supervised with paired data ○ This: unsupervised with non-parallel data ○ This: multi-target with non-parallel data ● Multi-target scenario (our contribution) ○ Model ○ Experiments

Multi-target Scenario (main contribution) ● Intuition: speech signals inherently carry both phonetic and speaker information. ● Learn the phonetic/speaker representation separately. ● Synthesize the target voice by combining the source phonetic representation and target speaker representation. Target speaker representation Source speaker representation Encoder Decoder phonetic How are you How are you representation: ”How are you”

Stage 1: disentanglement between phonetic and speaker representation ● Goal of classifier-1: maximize the likelihood being the speaker. Training Speaker Speaker 1 Decoder representation Phonetic Encoder Identify the representation: enc(x) Classifier-1 speaker Remove speaker information Reconstruction loss

Stage 1: disentanglement between phonetic and speaker representation ● Goal of classifier-1: maximize the likelihood being the speaker. ● Goal of encoder: minimize the likelihood being the speaker. Target speaker Testing Training representation Speaker Speaker 1 Decoder representation Phonetic Encoder Identify the representation: enc(x) Classifier-1 speaker Remove speaker information Train iteratively Reconstruction loss

Problem of stage 1: over-smoothed spectra ● Stage 1 alone can synthesis target voice to some extent. ● Reconstruction loss encourages the model to generate average value of the target. Leads to over-smoothed spectra, and result in buzzy synthesized speech. Training Decoder Speaker y Encoder enc(x) Predict Classifier-1 speaker y Reconstruction loss May be over-smoothed

Stage 2: patch the output with a residual signal ● Train another generator to produce residual signal, making the output more natural. Speaker Training representation From stage 1, fixed Phonetic: Encoder Decoder enc(x) Speaker Generator representation Residual signal

Stage 2: patch the output with a residual signal ● Discriminator is to discriminate whether synthesized or real data. ● Generator is to fool the discriminator. Real data Speaker Training representation From stage 1, fixed Phonetic: Encoder Decoder enc(x) Real or Discriminator generated Speaker Generator representation Residual signal

Stage 2: patch the output with a residual signal ● Classifier-2 is to identify the speaker. ● The generator will also try to make the classifier-2 predict correct speaker. Real data Speaker Training representation From stage 1, fixed Phonetic: Encoder Decoder enc(x) Real or Discriminator generated Identify the Speaker Classifier-2 Generator speaker representation Residual signal

Stage 2: patch the output with a residual signal ● Generator and discriminator/classifier-2 are trained iteratively. Real data Target speaker Speaker Training Testing representation From stage 1, fixed Phonetic: Encoder Decoder enc(x) Real or Discriminator generated Target speaker Identify the Speaker Classifier-2 Generator speaker representation Residual signal

Experiments - setting ● Feature: Short-time Fourier Transform (STFT) spectrograms. ● Corpus: 20 speakers from CSTR VCTK Corpus (for TTS). 90% training, 10% testing. ● Vocoder: Griffin-Lim (non-parametric method).

Experiments – spectrogram visualization ● Is stage 2 helpful? ● Sharpness of the spectrogram is improved by stage 2.

Experiments – subjective preference ● Ask users to choose their preference in terms of naturalness and similarity. ● Stage 2 improved. ● Comparable to baseline approach. Comparison to baseline [1]. Is stage 2 helpful? “Stage 1 + stage 2” is better. “Stage 1 + stage 2” is better. “ CycleGAN- VC” [1] is better. “ Stage 1 alone” is better. Indistinguishable . Indistinguishable. CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks. Kaneko et.al. EUSIPCO 2018 [1]

Demo Male to Female Source: Target: Converted: Female to Female Source: Target: Converted: Advisor(male, never seen in training data) to Female Source: Target: Converted: https://jjery2243542.github.io/voice_conversion_demo/

Conclusion ● A multi-target unsupervised approach for VC is proposed. ● Stage 1: disentanglement between phonetic and speaker representation. ● Stage 2: patch the output with residual signal to generate more natural speech.

Thanks for listening

Experiments – sharpness evaluation ● Speech signals have diversified distribution => high variance. ● Model with stage 2 training have highest variance.

Network architecture ● CNN + DNN + RNN ● Recurrent layer to generate varied length output. ● Dropout after each layer to provide noise for GAN-training.

Problem - training-testing mismatch Training Same speaker Decoder Speaker y Encoder enc(x) Predict Classifier-1 speaker y Testing Different speaker Speaker y ’ Decoder Encoder enc(x)

Multi-target Voice Conversion without Parallel Data by Adversarially - PowerPoint PPT Presentation

Setting - 2 Task Setting - 1 Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations Method Ju-chieh Chou , Cheng-chieh Yeh, Hung-yi Lee, Lin-shan Lee Best student paper award nominated

5 CONVERSION FUNCTIONS Data type conversion Implicit data type Explicit data type conversion

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Analysis of the Voice Conversion Challenge 2016 Evaluation Results Mirjam Wester, Zhizheng Wu

Data$Conversion ADC$and$DAC (aka$A/D$&$D/A) 1 Embedded$System 2 Signal$Conversion$System

Aisle Safety Light Brightness SFMTA Fleet Engineering Voice Annunciator Volume Voice

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

There is a voice speaking. That voice is sovereign. That voice alone is sovereign. Jeremiah

Target Risk vs. Target Date Funds in 401(k) Plans: Maybe the answer is both January 14, 2015

Getting Sta rted with Voice API Lorna Mitchell Getting Sta rted with Voice API Use the Voice

Welcome! Medicaid Operations Conversion Go-Live: August 1 , 2 0 1 5 2015 Medicaid Conversion 1

Conversion Plans Presented by: OPERS Employer Services 1 Agenda What is a Conversion Plan?

hadronic matter to quark matter Shock Induced Conversion Diffusion Induced Conversion Phys. Rev.

Closure conversion Expressing higher-order functions in first-order function languages Theory of

Digital Design Discussion: Numbers Binary to Decimal Conversion Decimal to Binary Conversion

A Fast Spatial Patch Blending Algorithm for Artefact Reduction in Pattern-based Image Inpainting

Curvature line parametrized surfaces and orthogonal coordinate systems Discretization with

Curves and paths in space Example : Define ( t ) := (cos t, sin t ) , t [0 , 1] . This

VECTOR-VALUED FUNCTIONS MATH 200 MAIN QUESTIONS FOR TODAY Whats a vector valued function?

Modeling images The order of presentations will be chosen randomly Subhransu Maji Remaning

Michele Selvaggi , for the Delphes Team Universit catholique de Louvain (UCL) Center for

Geodesic computation on a graph Graph: ( V, E ), V = { 1 , . . . , n } , E V 2 (symmetric). j

28. How to compute the flux Lets start with the case when S is the graph of a function z = f (

Multi-target Voice Conversion without Parallel Data by Adversarially - PowerPoint PPT Presentation

Setting - 2 Task Setting - 1 Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations Method Ju-chieh Chou , Cheng-chieh Yeh, Hung-yi Lee, Lin-shan Lee Best student paper award nominated

5 CONVERSION FUNCTIONS Data type conversion Implicit data type Explicit data type conversion

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Analysis of the Voice Conversion Challenge 2016 Evaluation Results Mirjam Wester, Zhizheng Wu

Data$Conversion ADC$and$DAC (aka$A/D$&amp;$D/A) 1 Embedded$System 2 Signal$Conversion$System

Aisle Safety Light Brightness SFMTA Fleet Engineering Voice Annunciator Volume Voice

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

There is a voice speaking. That voice is sovereign. That voice alone is sovereign. Jeremiah

Target Risk vs. Target Date Funds in 401(k) Plans: Maybe the answer is both January 14, 2015

Getting Sta rted with Voice API Lorna Mitchell Getting Sta rted with Voice API Use the Voice

Welcome! Medicaid Operations Conversion Go-Live: August 1 , 2 0 1 5 2015 Medicaid Conversion 1

Conversion Plans Presented by: OPERS Employer Services 1 Agenda What is a Conversion Plan?

hadronic matter to quark matter Shock Induced Conversion Diffusion Induced Conversion Phys. Rev.

Closure conversion Expressing higher-order functions in first-order function languages Theory of

Digital Design Discussion: Numbers Binary to Decimal Conversion Decimal to Binary Conversion

A Fast Spatial Patch Blending Algorithm for Artefact Reduction in Pattern-based Image Inpainting

Curvature line parametrized surfaces and orthogonal coordinate systems Discretization with

Curves and paths in space Example : Define ( t ) := (cos t, sin t ) , t [0 , 1] . This

VECTOR-VALUED FUNCTIONS MATH 200 MAIN QUESTIONS FOR TODAY Whats a vector valued function?

Modeling images The order of presentations will be chosen randomly Subhransu Maji Remaning

Michele Selvaggi , for the Delphes Team Universit catholique de Louvain (UCL) Center for

Geodesic computation on a graph Graph: ( V, E ), V = { 1 , . . . , n } , E V 2 (symmetric). j

28. How to compute the flux Lets start with the case when S is the graph of a function z = f (

Data$Conversion ADC$and$DAC (aka$A/D$&$D/A) 1 Embedded$System 2 Signal$Conversion$System