Unsupervised Voice Conversion by Separately Embedding Speaker and Content Information with Deep Generative Model 以分別嵌入語者及語言內容資訊之深層生成模 型達成無監督式語音轉換 Speaker: 周儒杰 (Ju-Chieh Chou) Advisor: 李琳山 (Lin-shan Lee)
Outline 1. Introduction • Voice Conversion • Branches • Motivation 2. Proposed Approach • Multi-target Model • Model • Experiments • One-shot Model • Model • Experiments 3. Conclusion
Outline 1. Introduction • Voice Conversion • Branches • Motivation 2. Proposed Approach • Multi-target Model • Model • Experiments • One-shot Model • Model • Experiments 3. Conclusion
Voice conversion Change the characteristic of an utterance while maintaining the ● language content the same. Characteristic: accent, speaker identity, emotion… ● This work: focus on speaker identity conversion. ● Speaker A Speaker 1 Model How are you How are you
Conventional: supervised with parallel data Speaker A Speaker 1 Same sentences, different signal from ● 2 speakers. Train a model to map from speaker 1 ● to speaker A. Problem: require parallel data, which ● is hard to collect. How are you How are you Parallel data Nice to meet you Nice to meet you I am fine I am fine
This work: unsupervised with non-parallel data Trained on non-parallel corpus, which is more attainable. ● Actively investigated. ● Prior work: utilize deep generative model, ex. VAE, GAN, cycleGAN. ● Speaker 1 Speaker A Don’t have to speak same sentences.
Voice Conversion Branches Yeh et.al. Parallel Data With Single-target Voice Transcription Conversion (phonemes) ch3 Non-parallel Data Without Data Multi-target Transcription Efficiency (phonemes) ch4 This work One-shot
Motivation ● Intuition: speech signals inherently carry both content and speaker information. ● Learn the content/speaker representation separately. ● Synthesize the target voice by combining the source content representation and target speaker representation. Target speaker representation Source speaker representation Encoder Decoder content How are you How are you representation: ”How are you”
Outline 1. Introduction • Voice Conversion • Branches • Motivation 2. Proposed Approach • Multi-target Model • Model • Experiments • One-shot Model • Model • Experiments 3. Conclusion
Multi-target unsupervised with non-parallel data 3 models are needed for 3 target speakers. Model-A Only one model is needed. Speaker A Speaker 1 Speaker A Model-B Model Speaker B Speaker B Speaker 1 Speaker C Model-C Speaker C 𝑂 2 models for N speakers.
Stage 1: disentanglement between content and speaker representation ● Goal of classifier-1: maximize the likelihood being the speaker. Training Speaker 1 Decoder Speaker id content Encoder Identify the representation: enc(x) Classifier-1 speaker Remove speaker information Reconstruction loss
Stage 1: disentanglement between content and speaker representation ● Goal of classifier-1: maximize the likelihood being the speaker. ● Goal of encoder: minimize the likelihood being the speaker. Target speaker id Testing Training Speaker 1 Decoder Speaker id content Encoder Identify the representation: enc(x) Classifier-1 speaker Remove speaker information Train iteratively Reconstruction loss
Problem of stage 1: training-testing mismatch Training Same speaker Decoder Speaker y Encoder enc(x) Predict Classifier-1 speaker y Testing Different speaker Speaker y’ Decoder Encoder enc(x)
Problem of stage 1: over-smoothed spectra Stage 1 alone can synthesis target voice to some extent. ● Reconstruction loss encourages the model to generate average value of the ● target (lack details). Leads to over-smoothed spectra, and result in buzzy synthesized speech. Training Decoder Speaker id Encoder enc(x) Predict Classifier-1 speaker Reconstruction loss May be over-smoothed
Stage 2: patch the output with a residual signal ● Random sample a speaker id as condition. ● Train another generator to produce residual signal (spectra details), making the output more natural. Random sampled Training speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Random sampled Generator speaker id Residual signal
Stage 2: patch the output with a residual signal ● Discriminator is to discriminate whether synthesized or real data. ● Generator is to fool the discriminator. Real data Random sampled Training speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Real or Discriminator generated Random sampled Generator speaker id Residual signal
Stage 2: patch the output with a residual signal ● Classifier-2 is to identify the speaker. ● The generator will also try to make the classifier-2 predict correct speaker. Real data Training Speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Real or Discriminator generated Identify the Classifier-2 Speaker id Generator speaker Residual signal
Stage 2: patch the output with a residual signal ● Generator and discriminator/classifier-2 are trained iteratively. Real data Target speaker Training Testing Speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Real or Discriminator generated Target speaker Identify the Classifier-2 Speaker id Generator speaker Residual signal
Experiments – spectrogram visualization Is stage 2 helpful? ● Sharpness of the spectrogram is ● improved by stage 2.
Experiments – subjective preference ● Ask subjects to choose their preference in terms of naturalness and similarity. ● Stage 2 improved. ● Comparable to baseline approach. Comparison to baseline [1]. Is stage 2 helpful? “Stage 1 + stage 2” is better. “Stage 1 + stage 2” is better. “ CycleGAN- VC” [1] is better. “Stage 1 alone” is better. Indistinguishable . Indistinguishable. CycleGAN-VC: Kaneko et.al. EUSIPCO 2018 [1]
Demo page: https://jjery2243542.github.io/voice_con version_demo/ Demo Male to Female Source: Target: Converted: Female to Female Source: Target: Converted: Prof. Hung-yi Lee(male, never seen in training data) to Female Source: Target: Converted:
Outline 1. Introduction • Voice Conversion • Branches • Motivation 2. Proposed Approach • Multi-target Model • Model • Experiments • One-shot Model • Model • Experiments 3. Conclusion
One-shot unsupervised with non-parallel data Prior work: only able to convert to This work: source/target speakers speakers in training data unseen during training. Speaker 1 Target Speaker Speaker 1 Speaker A Model Model Model Speaker B Speaker C Target speaker reference utterance (one-shot) Speaker id one-hot encoding (training data includes those of all target speakers)
Idea ● Speaker information - invariant within an utterance. ● Content information - varying within an utterance. Special Designed Layers: Feature map Channel Instance Normalization Layer: normalizing speaker IN ′ = 𝑁 𝑑 − 𝜈 𝑑 information( 𝜈, 𝜏 ) while preserving content information. M 𝑑 𝜏 𝑑 𝑈 𝑁 𝑑 Average Pooling Layer: calculating speaker information ( 𝛿, 𝛾) . AVG 𝑢 ′ = 𝑁 𝑑 𝑈 𝑢=1 Adaptive Instance Normalization Layer: provide speaker 𝑁 𝑑 − 𝜈 𝑑 AdaIN ′ = 𝛿 𝑑 information (𝛿, 𝛾) . M 𝑑 + 𝛾 𝑑 𝜏 𝑑
Intuition Normalize global information out (ex. high frequency), retain changes ● across time.
Model - training Problem: how to factorize the representations? S peaker AVG 𝑨 𝑡 Encoder 𝐹 𝑡 𝑦 𝑦 AdaIN Content Decoder 𝑨 𝑑 IN Encoder 𝐹 𝑑 D calculating speaker information( 𝛿, 𝛾 ). AVG IN normalizing speaker information ( 𝜈, 𝜏 ) while preserving content information. provide speaker information ( 𝛿, 𝛾 ). AdaIN
Model - testing Target speaker’s utterance S peaker AVG 𝑨 𝑡 Encoder 𝐹 𝑡 𝑦 𝑦 Converted AdaIN Content Decoder 𝑨 𝑑 IN Encoder 𝐹 𝑑 D Source speaker’s utterance calculating speaker information( 𝛿, 𝛾 ). AVG IN normalizing speaker information ( 𝜈, 𝜏 ) while preserving content information. provide speaker information ( 𝛿, 𝛾 ). AdaIN
Experiments – effect of IN Train another speaker classifier to see how much speaker information ● in content representations. The lower the accuracy is, the less speaker information it contains. ● Content encoder + IN: less speaker information. ● Speaker 𝑨 𝑑 (content Predict representation) speaker Classifier 𝑭 𝒅 With IN 𝑭 𝒅 Without IN Acc. 0.375 0.658
Experiments – speaker embedding visualization • Does speaker encoder learns meaningful representations? • One color represents one speaker’s utterances. • 𝑨 𝑡 from different speakers are well separated. S peaker Unseen speakers’ AVG 𝑨 𝑡 Encoder 𝐹 𝑡 utterances
Experiments - subjective • Ask subjects to score the similarity between 2 utterances in 4-scales.
Experiments - subjective • Ask subjects to score the similarity between 2 utterances in 4-scales. • Our model is able to generate the voice similar to target speaker’s.
Recommend
More recommend