Unsupervised Voice Conversion by Separately Embedding Speaker and - PowerPoint PPT Presentation

Unsupervised Voice Conversion by Separately Embedding Speaker and Content Information with Deep Generative Model 以分別嵌入語者及語言內容資訊之深層生成模型達成無監督式語音轉換 Speaker: 周儒杰 (Ju-Chieh Chou) Advisor: 李琳山 (Lin-shan Lee)

Outline 1. Introduction • Voice Conversion • Branches • Motivation 2. Proposed Approach • Multi-target Model • Model • Experiments • One-shot Model • Model • Experiments 3. Conclusion

Voice conversion Change the characteristic of an utterance while maintaining the ● language content the same. Characteristic: accent, speaker identity, emotion… ● This work: focus on speaker identity conversion. ● Speaker A Speaker 1 Model How are you How are you

Conventional: supervised with parallel data Speaker A Speaker 1 Same sentences, different signal from ● 2 speakers. Train a model to map from speaker 1 ● to speaker A. Problem: require parallel data, which ● is hard to collect. How are you How are you Parallel data Nice to meet you Nice to meet you I am fine I am fine

This work: unsupervised with non-parallel data Trained on non-parallel corpus, which is more attainable. ● Actively investigated. ● Prior work: utilize deep generative model, ex. VAE, GAN, cycleGAN. ● Speaker 1 Speaker A Don’t have to speak same sentences.

Voice Conversion Branches Yeh et.al. Parallel Data With Single-target Voice Transcription Conversion (phonemes) ch3 Non-parallel Data Without Data Multi-target Transcription Efficiency (phonemes) ch4 This work One-shot

Motivation ● Intuition: speech signals inherently carry both content and speaker information. ● Learn the content/speaker representation separately. ● Synthesize the target voice by combining the source content representation and target speaker representation. Target speaker representation Source speaker representation Encoder Decoder content How are you How are you representation: ”How are you”

Multi-target unsupervised with non-parallel data 3 models are needed for 3 target speakers. Model-A Only one model is needed. Speaker A Speaker 1 Speaker A Model-B Model Speaker B Speaker B Speaker 1 Speaker C Model-C Speaker C 𝑂 2 models for N speakers.

Stage 1: disentanglement between content and speaker representation ● Goal of classifier-1: maximize the likelihood being the speaker. Training Speaker 1 Decoder Speaker id content Encoder Identify the representation: enc(x) Classifier-1 speaker Remove speaker information Reconstruction loss

Stage 1: disentanglement between content and speaker representation ● Goal of classifier-1: maximize the likelihood being the speaker. ● Goal of encoder: minimize the likelihood being the speaker. Target speaker id Testing Training Speaker 1 Decoder Speaker id content Encoder Identify the representation: enc(x) Classifier-1 speaker Remove speaker information Train iteratively Reconstruction loss

Problem of stage 1: training-testing mismatch Training Same speaker Decoder Speaker y Encoder enc(x) Predict Classifier-1 speaker y Testing Different speaker Speaker y’ Decoder Encoder enc(x)

Problem of stage 1: over-smoothed spectra Stage 1 alone can synthesis target voice to some extent. ● Reconstruction loss encourages the model to generate average value of the ● target (lack details). Leads to over-smoothed spectra, and result in buzzy synthesized speech. Training Decoder Speaker id Encoder enc(x) Predict Classifier-1 speaker Reconstruction loss May be over-smoothed

Stage 2: patch the output with a residual signal ● Random sample a speaker id as condition. ● Train another generator to produce residual signal (spectra details), making the output more natural. Random sampled Training speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Random sampled Generator speaker id Residual signal

Stage 2: patch the output with a residual signal ● Discriminator is to discriminate whether synthesized or real data. ● Generator is to fool the discriminator. Real data Random sampled Training speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Real or Discriminator generated Random sampled Generator speaker id Residual signal

Stage 2: patch the output with a residual signal ● Classifier-2 is to identify the speaker. ● The generator will also try to make the classifier-2 predict correct speaker. Real data Training Speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Real or Discriminator generated Identify the Classifier-2 Speaker id Generator speaker Residual signal

Stage 2: patch the output with a residual signal ● Generator and discriminator/classifier-2 are trained iteratively. Real data Target speaker Training Testing Speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Real or Discriminator generated Target speaker Identify the Classifier-2 Speaker id Generator speaker Residual signal

Experiments – spectrogram visualization Is stage 2 helpful? ● Sharpness of the spectrogram is ● improved by stage 2.

Experiments – subjective preference ● Ask subjects to choose their preference in terms of naturalness and similarity. ● Stage 2 improved. ● Comparable to baseline approach. Comparison to baseline [1]. Is stage 2 helpful? “Stage 1 + stage 2” is better. “Stage 1 + stage 2” is better. “ CycleGAN- VC” [1] is better. “Stage 1 alone” is better. Indistinguishable . Indistinguishable. CycleGAN-VC: Kaneko et.al. EUSIPCO 2018 [1]

Demo page: https://jjery2243542.github.io/voice_con version_demo/ Demo Male to Female Source: Target: Converted: Female to Female Source: Target: Converted: Prof. Hung-yi Lee(male, never seen in training data) to Female Source: Target: Converted:

One-shot unsupervised with non-parallel data Prior work: only able to convert to This work: source/target speakers speakers in training data unseen during training. Speaker 1 Target Speaker Speaker 1 Speaker A Model Model Model Speaker B Speaker C Target speaker reference utterance (one-shot) Speaker id one-hot encoding (training data includes those of all target speakers)

Idea ● Speaker information - invariant within an utterance. ● Content information - varying within an utterance. Special Designed Layers: Feature map Channel Instance Normalization Layer: normalizing speaker IN ′ = 𝑁 𝑑 − 𝜈 𝑑 information( 𝜈, 𝜏 ) while preserving content information. M 𝑑 𝜏 𝑑 𝑈 𝑁 𝑑 Average Pooling Layer: calculating speaker information ( 𝛿, 𝛾) . AVG 𝑢 ′ = ෍ 𝑁 𝑑 𝑈 𝑢=1 Adaptive Instance Normalization Layer: provide speaker 𝑁 𝑑 − 𝜈 𝑑 AdaIN ′ = 𝛿 𝑑 information (𝛿, 𝛾) . M 𝑑 + 𝛾 𝑑 𝜏 𝑑

Intuition Normalize global information out (ex. high frequency), retain changes ● across time.

Model - training Problem: how to factorize the representations? S peaker AVG 𝑨 𝑡 Encoder 𝐹 𝑡 𝑦 𝑦 AdaIN Content Decoder 𝑨 𝑑 IN Encoder 𝐹 𝑑 D calculating speaker information( 𝛿, 𝛾 ). AVG IN normalizing speaker information ( 𝜈, 𝜏 ) while preserving content information. provide speaker information ( 𝛿, 𝛾 ). AdaIN

Model - testing Target speaker’s utterance S peaker AVG 𝑨 𝑡 Encoder 𝐹 𝑡 𝑦 𝑦 Converted AdaIN Content Decoder 𝑨 𝑑 IN Encoder 𝐹 𝑑 D Source speaker’s utterance calculating speaker information( 𝛿, 𝛾 ). AVG IN normalizing speaker information ( 𝜈, 𝜏 ) while preserving content information. provide speaker information ( 𝛿, 𝛾 ). AdaIN

Experiments – effect of IN Train another speaker classifier to see how much speaker information ● in content representations. The lower the accuracy is, the less speaker information it contains. ● Content encoder + IN: less speaker information. ● Speaker 𝑨 𝑑 (content Predict representation) speaker Classifier 𝑭 𝒅 With IN 𝑭 𝒅 Without IN Acc. 0.375 0.658

Experiments – speaker embedding visualization • Does speaker encoder learns meaningful representations? • One color represents one speaker’s utterances. • 𝑨 𝑡 from different speakers are well separated. S peaker Unseen speakers’ AVG 𝑨 𝑡 Encoder 𝐹 𝑡 utterances

Experiments - subjective • Ask subjects to score the similarity between 2 utterances in 4-scales.

Experiments - subjective • Ask subjects to score the similarity between 2 utterances in 4-scales. • Our model is able to generate the voice similar to target speaker’s.

Unsupervised Voice Conversion by Separately Embedding Speaker and - PowerPoint PPT Presentation

Unsupervised Voice Conversion by Separately Embedding Speaker and Content Information with Deep Generative Model Speaker: (Ju-Chieh Chou)

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

5 CONVERSION FUNCTIONS Data type conversion Implicit data type Explicit data type conversion

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Analysis of the Voice Conversion Challenge 2016 Evaluation Results Mirjam Wester, Zhizheng Wu

Aisle Safety Light Brightness SFMTA Fleet Engineering Voice Annunciator Volume Voice

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

There is a voice speaking. That voice is sovereign. That voice alone is sovereign. Jeremiah

Data$Conversion ADC$and$DAC (aka$A/D$&$D/A) 1 Embedded$System 2 Signal$Conversion$System

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

Early Level 1b evaluation based on HIRS experience and AIRS Data Product Validation Larry

A Ch Characteri acteriza zation tion of All ll Retr trof ofit it Co Contr ntroller

Information Transmission Chapter 4, Analog modulation OVE EDFORS ELECTRICAL AND INFORMATION

3/21/16 CS Majors Tea Anima1ng with transforma1ons Monday

Model Learning Data Analysis Project Madalina Fiterau DAP Committee Artur Dubrawski Jeff

Application of A Zero-latency Whitening Filter to Compact Binary Coalescence GW Searches Leo

Wireless Communication Systems @CS.NCTU Lecture 10: Rate Adaptation Frequency-Aware Rate

Using Ambient Radio Signals Andrei Popleteev SnT, University of Luxembourg 2 Image: Particle

Unsupervised Voice Conversion by Separately Embedding Speaker and - PowerPoint PPT Presentation

Unsupervised Voice Conversion by Separately Embedding Speaker and Content Information with Deep Generative Model Speaker: (Ju-Chieh Chou)

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

5 CONVERSION FUNCTIONS Data type conversion Implicit data type Explicit data type conversion

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Analysis of the Voice Conversion Challenge 2016 Evaluation Results Mirjam Wester, Zhizheng Wu

Aisle Safety Light Brightness SFMTA Fleet Engineering Voice Annunciator Volume Voice

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

There is a voice speaking. That voice is sovereign. That voice alone is sovereign. Jeremiah

Data$Conversion ADC$and$DAC (aka$A/D$&amp;$D/A) 1 Embedded$System 2 Signal$Conversion$System

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

Early Level 1b evaluation based on HIRS experience and AIRS Data Product Validation Larry

A Ch Characteri acteriza zation tion of All ll Retr trof ofit it Co Contr ntroller

Information Transmission Chapter 4, Analog modulation OVE EDFORS ELECTRICAL AND INFORMATION

3/21/16 CS Majors Tea Anima1ng with transforma1ons Monday

Model Learning Data Analysis Project Madalina Fiterau DAP Committee Artur Dubrawski Jeff

Application of A Zero-latency Whitening Filter to Compact Binary Coalescence GW Searches Leo

Wireless Communication Systems @CS.NCTU Lecture 10: Rate Adaptation Frequency-Aware Rate

Using Ambient Radio Signals Andrei Popleteev SnT, University of Luxembourg 2 Image: Particle

Data$Conversion ADC$and$DAC (aka$A/D$&$D/A) 1 Embedded$System 2 Signal$Conversion$System