Voice Conversion and Anti-spoofing of Speaker Verification Haizhou Li Acknowledgement: Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Xiaohai Tian 1
Agenda • Spoofing Attacks • Voice Conversion • Artifacts • ASVspoof 2015 2
Agenda • Spoofing Attacks • Voice Conversion • Artifacts • ASVspoof 2015 3
Speaker Verification Reject! This is John! Speaker Verification Yes, John! 4
Spoofing Attacks This is John! Reject! Impersonation Replay Speaker Speech Verification Synthesis Voice Conversion Yes, John! 5
Spoofing Attacks Spoofing Effectiveness (risk) Countermeasure Accessibility attack availability Text-independent Text-dependent Impersonation Low Low/unknown Low/unknown N.A. Low Replay High Low Low to high Speech Medium High High Medium synthesis to high Voice Medium High High Medium conversion to high Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre , and H. Li, “Spoofing and countermeasures for speaker verification: a survey,” Speech Communication, vol. 66, pp. 130 – 153, 2015. 6
Impersonation Spoofing Effectiveness (risk) Countermeasure Accessibility attack availability Text-independent Text-dependent Impersonation Low Low/unknown Low/unknown N.A. Low Replay High Low Low to high Speech Medium High High Medium synthesis to high Voice Medium High High Medium conversion to high • Y. Lau, D. Tran, and M. Wagner, “Testing voice mimicry with the YOHO speaker verification corpus,” in Knowledge -Based Intelligent Information and Engineering Systems. Springer, 2005, pp. 907 – 907. • J. Mariethoz and S. Bengio , “Can a professional imitator fool a GMM based speaker verification system?” IDIAP Research Report (No. Idiap- RR-61-2005), 2005. • R. G. Hautamaki, T. Kinnunen, V. Hautamaki, T. Leino, and A.-M. Laukkanen , “I -vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry,” in Interspeech 2013 7
Replay Effectiveness (risk) Spoofing Countermeasure Accessibility attack Text-independent Text-dependent availability Impersonation Low Low/unknown Low/unknown N.A. Low Replay High Low Low to high Speech Medium High High Medium synthesis to high Voice Medium High High Medium conversion to high Zhizheng Wu, Sheng Gao, Eng Siong Chng, Haizhou Li, "A study on replay attack and anti-spoofing for text-dependent speaker verification", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2014. 8
Traits of Replay • J. Villalba and E. Lleida, “Preventing replay attack on speaker verification systems’, IEEE ICCST 2011 • L. Cuccovillo, P. Aichroth, “Open - set microphone classification via blind channel analysis”, ICASSP 2016 9
Audio Fingerprinting Genuine speech 8000 Frequency (Hz) 0 1.0 2.0 3.0 Time (Seconds) Replay speech 8000 Frequency (Hz) 0 1.0 2.0 3.0 Time (Seconds) 1. A. Wang, “An industrial strength audio search algorithm,” in Proc. Int. Symposium on Music Information Retrieval (ISMIR), 2003, pp. 7 – 13. 2. Zhizheng Wu, Sheng Gao, Eng Siong Chng, Haizhou Li, "A study on replay attack and anti-spoofing for text-dependent speaker verification", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2014. 10
Spoofing Attacks Effectiveness (risk) Spoofing Countermeasure Accessibility attack availability Text-independent Text-dependent Impersonation Low Low/unknown Low/unknown N.A. Low Replay High Low Low to high Speech Medium High High Medium synthesis to high Voice Medium High High Medium conversion to high Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre , and H. Li, “Spoofing and countermeasures for speaker verification: a survey,” Speech Communication, vol. 66, pp. 130 – 153, 2015. 11
Speaker Verification: Robust Features • Modeling the human voice production system • Modeling the peripheral auditory system Tomi Kinnunen and Haizhou Li, “An Overview of Text -Independent Speaker Recognition: from Features to Supervectors ”, Speech Communication 52(1): 12 --40, January 2010 12
More Robust = More Vulnerable Reject! Reject! This is John! Synthetic No Speaker Speech Verification Detection Yes, John! 13
Agenda • Spoofing Attacks • Voice Conversion • Artifacts • ASVspoof 2015 14
Voice Conversion: Vocoder Source Target Speaker Speaker Feature Synthesis Analysis conversion 15
Vocoder: Analysis - Synthesis source target Analysis Synthesis Source Target Speaker Speaker Feature Synthesis Analysis conversion 16
Vocoder Sinusoidal vocoders Harmonic plus noise model (HNM) vocoder Harmonic and stochastic vocoder Adaptive harmonic vocoder Source-filter model Linear predictive vocoder Mel – generalised cepstral vocoder STRAIGHT Glottal vocoder 17
Vocoder: Copy Synthesis Source Target Synthesis Analysis Feature EER (%) MFCC 10.98 MGDCC 1.25 MGDCC+PM 0.89 Z. Wu, X. Xiao, E.S. Chng, H. Li, “Synthetic Speech Detection Using Temporal Modulation Feature”, ICASSP 2013 18
Voice Conversion: Feature Conversion Source Target Speaker Speaker Feature Synthesis Analysis conversion 19
Differences between Speakers Speaker A Speaker B Z. Wu, Spectral Mapping for Voice Conversion, Ph.D Thesis, Nanyang Technological University, 2015 20
Basics of Voice Conversion Conversion Training 21
Chronological Map of Voice Conversion Z. Wu et al, Tutorial Notes, APSIPA ASC 2015 22
Voice Conversion: Codebook Mapping source target • Z. Wu, Spectral mapping for voice conversion, Ph.D Thesis, Nanyang Technological University, 2015 • Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara. "Voice conversion through vector quantization." ICASSP 1988 23
Voice Conversion: Joint Density GMM source target • Alexander Kain, and Michael W. Macon. "Spectral voice conversion for text-to-speech synthesis." ICASSP 1998 24
Voice Conversion: Frequency Warping Target spectrum Use partially the source spectrum information Source spectrum • Daniel Erro, Asunción Moreno, and Antonio Bonafonte. "Voice conversion based on weighted frequency warping." IEEE Transactions on Audio, Speech, and Language Processing, 18, no. 5 (2010): 922-931. • Xiaohai Tian, Zhizheng Wu, Siu Wa Lee, Nguyen Quy Hy, Eng Siong Chng, Minghui Dong, "Sparse representation for frequency warping based voice conversion", ICASSP 2015 25
Voice Conversion: Frame/Unit Selection • Thierry Dutoit, Andre Holzapfel, Matthieu Jottrand, Alexis Moinet, J. M. Perez, and Yannis Stylianou. "Towards a voice conversion system based on frame selection." ICASSP 2007. • Zhizheng Wu, Tuomas Virtanen, Tomi Kinnunen, Eng Siong Chng, Haizhou Li, "Exemplar-based unit selection for voice conversion utilizing temporal information", Interspeech 2013 26
Unit Selection Synthesis • Source symbol – Target segment costs: suitability of unit for target • Target segment -Target segment costs: acoustic continuity of two adjacent units # dh ax c ae t s ae t # # dh ax c ae t s ae t # # dh ax c ae t s ae t # # dh ax c ae t s ae t # # dh ax c ae t s ae # # ax ae t ae # ae Z. Wu et al, Tutorial Notes, APSIPA ASC 2015
Evaluation of Synthetic Voice Subjective Analysis Objective Analysis “ Spoofing Analysis” 1. Spectral distortion 2. Temporal (magnitude/phase) discontinuity 3. Spectro-temporal artifacts 4. Pitch pattern 5. ASVspoof 2015 ? 28
Agenda • Spoofing Attacks • Voice Conversion • Artifacts • ASVspoof 2015 29
Artifacts Magnitude • Short-time Fourier transform (STFT) • Smoothing effect (local vs global optimization) • Temporal magnitude discontinuity Phase • Minimum phase vocoding • Phase distortion • Temporal phase discontinuity … that are common to synthetic speech … that are different from natural speech 30
Magnitude: STFT • Time-Frequency resolution • Spectral leakage • Windowing tradeoffs 31
Magnitude: Smoothing in Vocoder Hideki Kawahara, STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds, Acoust. Sci. & Tech. 27, 6 (2006) 32
Magnitude : Smoothing in synthesized/converted speech A. Kain and M. W. Macon, “Spectral voice conversion for text -to- speech synthesis,” in ICASSP 1998. Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura “Speech Synthesis Based on Hidden Markov Models” Proceedings of The IEEE, 2013 33
Magnitude : Log Magnitude Spectrum Natural speech Copy synthetic speech Absolute difference X. Tian, Z. Wu, X. Xiao, E. S. Chng, H. Li, "Spoofing detection from a feature representation perspective", ICASSP 2016 3 Tian Xiaohai 4
Recommend
More recommend