Combining Speech and Speaker Recognition - A Joint Modeling Approach - PowerPoint PPT Presentation

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Connecting Speech and Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by: Prof. N. Morgan, Dr. S. Wegmann EECS, University of California, Berkeley, CA USA International Computer Science Institute, Berkeley, CA USA August 16, 2018 Hang Su Dissertation Talk 1 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Connecting Speech and Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Table of contents Introduction and Motivation 1 Backgrounds on Speech and Speaker Recognition 2 Connecting Speech and Speaker Recognition 3 Joint Modeling of Speech and Speaker 4 Conclusion and Future Work 5 Hang Su Dissertation Talk 2 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work Table of contents Introduction and Motivation 1 Backgrounds on Speech and Speaker Recognition 2 Connecting Speech and Speaker Recognition 3 Joint Modeling of Speech and Speaker 4 Conclusion and Future Work 5 Hang Su Dissertation Talk 3 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work Joint modeling of speech and speaker The brief idea Automatic speech recognition (ASR) translate speech to text automatically Speaker recognition or speaker identification identify speakers from characteristics of voice Combining speech and speaker recognition capture speech and speaker characteristics together Hang Su Dissertation Talk 4 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work Why speech / speaker recognition Application of speech & speaker recognition Human-Computer Interface Automatic speech recognition In-car system, smart home, speech search... Speaker recognition Authentication, safety, personalization... Hang Su Dissertation Talk 5 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work A problem They are handled separately Different datasets / evaluations Different models / methods But they are closely related to each other Take speech as input Similar features / models Hang Su Dissertation Talk 6 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work A problem They are handled separately Different datasets / evaluations Different models / methods But they are closely related to each other Take speech as input Similar features / models (Same group of researchers :) Hang Su Dissertation Talk 6 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work An ideal AI agent for speech Hang Su Dissertation Talk 7 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Introduction Connecting Speech and Speaker Recognition Motivation Joint Modeling of Speech and Speaker An ideal AI agent for speech Conclusion and Future Work An ideal AI agent for speech Hang Su Dissertation Talk 8 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Table of contents Introduction and Motivation 1 Backgrounds on Speech and Speaker Recognition 2 Connecting Speech and Speaker Recognition 3 Joint Modeling of Speech and Speaker 4 Conclusion and Future Work 5 Hang Su Dissertation Talk 9 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Table of contents Introduction and Motivation 1 Backgrounds on Speech and Speaker Recognition 2 Automatic Speech Recognition Speaker Recognition Connecting Speech and Speaker Recognition 3 Joint Modeling of Speech and Speaker 4 Conclusion and Future Work 5 Hang Su Dissertation Talk 10 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Automatic Speech Recognition (ASR) Transcribe speech into texts Frame-by-frame approach (10 ~30 ms) Components ∗ : Feature extraction Acoustic modeling (GMM-HMM) Lexicon Language modeling (LM) Or use end-to-end approach: discard HMM, optionally discard lexicon or language model ∗ For a traditional ASR system. Hang Su Dissertation Talk 11 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Traditional ASR pipeline Hang Su Dissertation Talk 12 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Gaussian Mixture Model - HMM[9, 3] Hang Su Dissertation Talk 13 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Deep Neural Network - HMM[1, 11] Hang Su Dissertation Talk 14 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Long-Short Term Memory - HMM [8] Hang Su Dissertation Talk 15 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Table of contents Introduction and Motivation 1 Backgrounds on Speech and Speaker Recognition 2 Automatic Speech Recognition Speaker Recognition Connecting Speech and Speaker Recognition 3 Joint Modeling of Speech and Speaker 4 Conclusion and Future Work 5 Hang Su Dissertation Talk 16 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Speaker Recognition Speaker Recognition: Identify speakers from speech Components: Feature extraction Acoustic modeling Speaker modeling Scoring Make utterance-level predictions Hang Su Dissertation Talk 17 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Text-independent speaker recognition Hang Su Dissertation Talk 18 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Factor analysis approach [2] K � x t ∼ π k N ( µ k + A k z i , Σ k ) k (1) K z i ∼ N (0 , I ) � π k = 1 k =1 x t is p -dim speech feature for frame t π k is prior for mixture k z i : a q -dim speaker specific latent factor (i.e. i-vector) A k : a p -by- q projection matrix for mixture c µ k and Σ k are Gaussian parameters Hang Su Dissertation Talk 19 / 71

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Automatic Speech Recognition Connecting Speech and Speaker Recognition Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Post-processing of i-vectors The factor-analysis model is an unsupervised model. Supervised methods could be used to improve i-vectors. Linear Discriminant Analysis [6] Probabilistic Linear Discriminant Analysis [6, 5] Hang Su Dissertation Talk 20 / 71

Combining Speech and Speaker Recognition - A Joint Modeling Approach - PowerPoint PPT Presentation

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Connecting Speech and Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Combining Speech and Speaker Recognition - A Joint Modeling

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Speaker Recognition and Speaker Recognition and the ETSI Standard the ETSI Standard Distributed

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

How to AI COGS 105 Many robotics and engineering problems work from a task- Week 14b: AI and

Agreement as a window to the process of corpus annotation Ron Artstein 29 September 2012 The

TCIPG TESTBEDS RESEARCH, CAPABILITIES, INDUSTRY NOVEMBER 12, 2014 TIM YARDLEY UNIVERSITY OF

When Priority Resolution Goes Way Too Far: An Experimental Evaluation in PLC Networks Cristina

Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr What do we want AI to do? Keep us

12/17/2015 Gregory Pashayan 1

Event-driven Video Frame Synthesis Zihao Wang 1 , Weixin Jiang 1 , Kuan He 1 , Boxin Shi 2 ,

Lecture 06 Wireless Communication I-Hsiang Wang ihwang@ntu.edu.tw National Taiwan University

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Combining Speech and Speaker Recognition - A Joint Modeling Approach - PowerPoint PPT Presentation

Introduction and Motivation Backgrounds on Speech and Speaker Recognition Connecting Speech and Speaker Recognition Joint Modeling of Speech and Speaker Conclusion and Future Work Combining Speech and Speaker Recognition - A Joint Modeling

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Speaker Recognition and Speaker Recognition and the ETSI Standard the ETSI Standard Distributed

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

How to AI COGS 105 Many robotics and engineering problems work from a task- Week 14b: AI and

Agreement as a window to the process of corpus annotation Ron Artstein 29 September 2012 The

TCIPG TESTBEDS RESEARCH, CAPABILITIES, INDUSTRY NOVEMBER 12, 2014 TIM YARDLEY UNIVERSITY OF

When Priority Resolution Goes Way Too Far: An Experimental Evaluation in PLC Networks Cristina

Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr What do we want AI to do? Keep us

12/17/2015 Gregory Pashayan 1

Event-driven Video Frame Synthesis Zihao Wang 1 , Weixin Jiang 1 , Kuan He 1 , Boxin Shi 2 ,

Lecture 06 Wireless Communication I-Hsiang Wang ihwang@ntu.edu.tw National Taiwan University

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and