Introduction and Motivation Backgrounds on Speech and Speaker Recognition Connecting Speech and Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by: Prof. N. Morgan, Dr. S. Wegmann EECS, University of California, Berkeley, CA USA International Computer Science Institute, Berkeley, CA USA August 16, 2018 Hang Su Dissertation Talk 1 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Connecting Speech and Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Table of contents Introduction and Motivation 1 Backgrounds on Speech and Speaker Recognition 2 Connecting Speech and Speaker Recognition 3 Joint Modeling of Speech and Speaker 4 Conclusion and Future Work 5 Hang Su Dissertation Talk 2 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work Table of contents Introduction and Motivation 1 Backgrounds on Speech and Speaker Recognition 2 Connecting Speech and Speaker Recognition 3 Joint Modeling of Speech and Speaker 4 Conclusion and Future Work 5 Hang Su Dissertation Talk 3 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work Joint modeling of speech and speaker The brief idea Automatic speech recognition (ASR) translate speech to text automatically Speaker recognition or speaker identification identify speakers from characteristics of voice Combining speech and speaker recognition capture speech and speaker characteristics together Hang Su Dissertation Talk 4 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work Why speech / speaker recognition Application of speech & speaker recognition Human-Computer Interface Automatic speech recognition In-car system, smart home, speech search... Speaker recognition Authentication, safety, personalization... Hang Su Dissertation Talk 5 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work A problem They are handled separately Different datasets / evaluations Different models / methods But they are closely related to each other Take speech as input Similar features / models Hang Su Dissertation Talk 6 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work A problem They are handled separately Different datasets / evaluations Different models / methods But they are closely related to each other Take speech as input Similar features / models (Same group of researchers :) Hang Su Dissertation Talk 6 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work An ideal AI agent for speech Hang Su Dissertation Talk 7 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work An ideal AI agent for speech Hang Su Dissertation Talk 8 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Table of contents Introduction and Motivation 1 Backgrounds on Speech and Speaker Recognition 2 Connecting Speech and Speaker Recognition 3 Joint Modeling of Speech and Speaker 4 Conclusion and Future Work 5 Hang Su Dissertation Talk 9 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Table of contents Introduction and Motivation 1 Backgrounds on Speech and Speaker Recognition 2 Automatic Speech Recognition Speaker Recognition Connecting Speech and Speaker Recognition 3 Joint Modeling of Speech and Speaker 4 Conclusion and Future Work 5 Hang Su Dissertation Talk 10 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Automatic Speech Recognition (ASR) Transcribe speech into texts Frame-by-frame approach (10 ~30 ms) Components ∗ : Feature extraction Acoustic modeling (GMM-HMM) Lexicon Language modeling (LM) Or use end-to-end approach: discard HMM, optionally discard lexicon or language model ∗ For a traditional ASR system. Hang Su Dissertation Talk 11 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Traditional ASR pipeline Hang Su Dissertation Talk 12 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Gaussian Mixture Model - HMM[9, 3] Hang Su Dissertation Talk 13 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Deep Neural Network - HMM[1, 11] Hang Su Dissertation Talk 14 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Long-Short Term Memory - HMM [8] Hang Su Dissertation Talk 15 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Table of contents Introduction and Motivation 1 Backgrounds on Speech and Speaker Recognition 2 Automatic Speech Recognition Speaker Recognition Connecting Speech and Speaker Recognition 3 Joint Modeling of Speech and Speaker 4 Conclusion and Future Work 5 Hang Su Dissertation Talk 16 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Speaker Recognition Speaker Recognition: Identify speakers from speech Components: Feature extraction Acoustic modeling Speaker modeling Scoring Make utterance-level predictions Hang Su Dissertation Talk 17 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Text-independent speaker recognition Hang Su Dissertation Talk 18 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Factor analysis approach [2] K � x t ∼ π k N ( µ k + A k z i , Σ k ) k (1) K z i ∼ N (0 , I ) � π k = 1 k =1 x t is p -dim speech feature for frame t π k is prior for mixture k z i : a q -dim speaker specific latent factor (i.e. i-vector) A k : a p -by- q projection matrix for mixture c µ k and Σ k are Gaussian parameters Hang Su Dissertation Talk 19 / 71
Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Post-processing of i-vectors The factor-analysis model is an unsupervised model. Supervised methods could be used to improve i-vectors. Linear Discriminant Analysis [6] Probabilistic Linear Discriminant Analysis [6, 5] Hang Su Dissertation Talk 20 / 71
Recommend
More recommend