s2s asr advanced issues
play

S2S ASR Advanced issues Tight coupling Tight coupling ASR should - PowerPoint PPT Presentation

S2S ASR Advanced issues Tight coupling Tight coupling ASR should output N ASR should output N- -best best Translated all (lattice) Translated all (lattice) Choose best translation Choose best translation


  1. S2S ASR Advanced issues � Tight coupling Tight coupling � � ASR should output N ASR should output N- -best best � � Translated all (lattice) Translated all (lattice) � � Choose best translation Choose best translation � � (MT as a LM for ASR) (MT as a LM for ASR) � � Remove Remove disfluencies/hestitations disfluencies/hestitations � � Add more relevant data Add more relevant data � � Automatically convert past tense/third person data to Automatically convert past tense/third person data to � present tense/first+second first+second person … person … present tense/

  2. S2S TTS Advance Issues MT output isn’t gramtical gramtical � MT output isn’t � � TTS doesn’t care and just says it TTS doesn’t care and just says it � � TTS should try to say MT output with more TTS should try to say MT output with more � breaks. breaks. TTS (unit selection) � TTS (unit selection) � � As a LM on MT output As a LM on MT output � � Choose the best translation on what is said best Choose the best translation on what is said best �

  3. Speech Processing 15-492/18-492 Voice Conversion

  4. Voice Conversion � Live (or offline) Live (or offline) � � Convert an existing voice to another Convert an existing voice to another � � Use only a small amount of target speech Use only a small amount of target speech � � Uses: Uses: � � Synthesis without collecting lots of data Synthesis without collecting lots of data � � Disguising voices Disguising voices � � Emotional voices without full synthesis support Emotional voices without full synthesis support � � Also called Also called � � Voice transformation, Voice morphing Voice transformation, Voice morphing �

  5. Voice Identity What makes a voice identity � What makes a voice identity � � Lexical Choice: Lexical Choice: �  Woo Woo- -hoo hoo, ,   I pity the fool … I pity the fool …  � Phonetic choice Phonetic choice � � Intonation and duration Intonation and duration � � Spectral qualities (vocal tract shape) Spectral qualities (vocal tract shape) � � Excitation Excitation �

  6. Voice Conversion techniques Full ASR and TTS � Full ASR and TTS � � Much too hard to do reliably Much too hard to do reliably � Codebook transformation � Codebook transformation � � ASR HMM state to HMM state transformation ASR HMM state to HMM state transformation � GMM based transformation � GMM based transformation � � Build a mapping function between frames Build a mapping function between frames �

  7. Learning VC models First need to get parallel speech � First need to get parallel speech � � Source and Target say same thing Source and Target say same thing � � Use DTW to align (in the spectral domain) Use DTW to align (in the spectral domain) � � Trying to learn a functional mapping Trying to learn a functional mapping � � 20 20- -50 utterances 50 utterances � “Text- -independent” VC independent” VC � “Text � � Means no parallel speech available Means no parallel speech available � � Use some form of synthesis to generate it Use some form of synthesis to generate it �

  8. VC Training process Extract F0, power and MFCC from source � Extract F0, power and MFCC from source � and target utterances and target utterances DTW align source and target � DTW align source and target � Loop until convergence � Loop until convergence � � Build GMM to map between source/target Build GMM to map between source/target � � DTW source/target using GMM mapping DTW source/target using GMM mapping �

  9. VC Training process

  10. VC Run-time

  11. Voice Transformation - Festvox Festvox GMM transformation suite (Toda) GMM transformation suite (Toda) - awb bdl bdl jmk slt awb jmk slt awb awb bdl bdl jmk jmk slt slt

  12. VC in Synthesis Can be used as a post filter in synthesis � Can be used as a post filter in synthesis � � Build Build kal_diphone kal_diphone to target VC to target VC � � Use on all output of Use on all output of kal_diphone kal_diphone � Can be used to convert a full DB � Can be used to convert a full DB � � Convert a full db and rebuild a voice Convert a full db and rebuild a voice �

  13. Style/Emotion Conversion Unit Selection (or SPS) � Unit Selection (or SPS) � � Require lots of data in desired style/emotion Require lots of data in desired style/emotion � VC technique � VC technique � � Use as filter to main voice (same speaker) Use as filter to main voice (same speaker) � � Convert neutral to angry, sad, happy … Convert neutral to angry, sad, happy … �

  14. Can you say that again? Voice conversion for speaking in noise � Voice conversion for speaking in noise � Different quality when you repeat things � Different quality when you repeat things � Different quality when you speak in noise � Different quality when you speak in noise � � Lombard effect (when very loud) Lombard effect (when very loud) � � “Speech “Speech- -in in- -noise” in regular noise noise” in regular noise �

  15. Speaking in Noise (Langner) � Collect data Collect data � � Randomly play noise in person’s ears Randomly play noise in person’s ears � � Normal Normal � � In Noise In Noise � � Collect 500 of each type Collect 500 of each type � � Build VC model Build VC model � � Normal Normal - -> in > in- -Noise Noise � � Actually Actually � � Spectral, duration, f0 and power differences Spectral, duration, f0 and power differences �

  16. Synthesis in Noise � For bus information task For bus information task � � Play different synthesis information Play different synthesis information utts utts � � With SIN synthesizer With SIN synthesizer � � With SWN synthesizer With SWN synthesizer � � With VC (SWN With VC (SWN- ->SIN) synthesizer >SIN) synthesizer � � Measure their understanding Measure their understanding � � SIN synthesizer better (in Noise) SIN synthesizer better (in Noise) � � SIN synthesizer better (without Noise for elderly) SIN synthesizer better (without Noise for elderly) �

  17. Transterpolation Incrementally transform a voice X% � Incrementally transform a voice X% � � BDL BDL- -SLT by 10% SLT by 10% � � SLT SLT- -BDL by 10% BDL by 10% � Count when you think it changes from M- -F F � Count when you think it changes from M � Fun but what are the uses … � Fun but what are the uses … �

  18. De-identification Remove speaker identity � Remove speaker identity � � But keep it still human like But keep it still human like � Health Records � Health Records � � HIPAA laws require this HIPAA laws require this � � Not just removing names and Not just removing names and SSNs SSNs � Use Voice conversion to get “new” voices � Use Voice conversion to get “new” voices �

  19. VC and SPS Becoming closely related � Becoming closely related � � Small amount of target speaker Small amount of target speaker � � Use larger background models Use larger background models �

  20. Cross Lingual Voice Conversion Use phonetic mapping synthesis � Use phonetic mapping synthesis � � Sounds like very accented speech Sounds like very accented speech � Use VC to convert the output � Use VC to convert the output � � Require only small amount of target language Require only small amount of target language �

Recommend


More recommend