Language, Dialect, and Speaker Recognition Using Gaussian Mixture - PowerPoint PPT Presentation

Language, Dialect, and Speaker Recognition Using Gaussian Mixture Models on the Cell Processor Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner {nmalyska, smohindra, karen.lauro, reynolds, kepner}@ll.mit.edu This work is sponsored by the United States Air Force under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory

Outline • Introduction • Recognition for speech applications using GMMs • Parallel implementation of the GMM • Performance model • Conclusions and future work MIT Lincoln Laboratory

Introduction Automatic Recognition Systems • In this presentation, we will discuss technology that can be applied to different kinds of recognition systems – Language recognition – Dialect recognition – Speaker recognition Who is the speaker? What dialect are they using? What language are they speaking? MIT Lincoln Laboratory

Introduction The Scale Challenge • Speech processing problems are often described as one person interacting with a single computer system and receiving a response MIT Lincoln Laboratory

Introduction The Scale Challenge • Real speech applications, however, often involve data from multiple talkers and use multiple networked multicore machines – Interactive voice response systems – Voice portals – Large corpus evaluations with hundreds of hours of data Information About Speaker, Dialect, or Language MIT Lincoln Laboratory

Introduction The Computational Challenge • Speech-processing algorithms are computationally expensive • Large amounts of data need to be available for these applications – Must cache required data efficiently so that it is quickly available • Algorithms must be parallelized to maximize throughput – Conventional approaches focus on parallel solutions over multiple networked computers – Existing packages not optimized for high-performance-per-watt machines with multiple cores, required in embedded systems with power, thermal, and size constraints – Want highly-responsive “real-time” systems in many applications, including in embedded systems MIT Lincoln Laboratory

Outline • Introduction • Recognition for speech applications using GMMs • Parallel implementation of the GMM • Performance model • Conclusions and future work MIT Lincoln Laboratory

Recognition Systems Summary • A modern language, dialect, or speaker recognition system is composed of two main stages – Front-end processing – Pattern recognition Decision on the Pattern Front End identity, dialect, Speech Recognition or language of speaker • We will show how a speech signal is processed by modern recognition systems – Focus on a recognition technology called Gaussian mixture models MIT Lincoln Laboratory

Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory

Recognition Systems Front-End Processing • Front-end processing converts observed speech frames into an alternative representation, features – Lower dimensionality – Carries information relevant to the problem Speech Frames Feature Vectors X = x x � x { , , , } 1 2 K Dim 1 Front End Dim 2 x x x x 1 2 3 4 Feature Number Frame Number MIT Lincoln Laboratory

Recognition Systems Pattern Recognition Training Training Features • A recognition system makes decisions about observed Dim 1 data based on a knowledge of past data Dim 2 x x x x 1 2 3 4 • During training , the system learns about the data it uses to make decisions – A set of features are collected from a certain language, dialect, or speaker MIT Lincoln Laboratory

Recognition Systems Pattern Recognition Training Training Features • A recognition system makes decisions about observed data based on a knowledge of past data x • During training , the system 2 x 1 learns about the data it uses Dim 2 Dim 1 to make decisions – A set of features are Model p x collected from a certain ( ) language, dialect, or speaker – A model is generated to represent the data Dim 2 Dim 1 MIT Lincoln Laboratory

Recognition Systems Gaussian Mixture Models • A Gaussian mixture model (GMM) represents features as the weighted sum of multiple Gaussian distributions • Each Gaussian state i has a Model λ μ – Mean i λ x ( | ) p Σ – Covariance i – Weight w i Dim 2 Dim 1 MIT Lincoln Laboratory

Recognition Systems Gaussian Mixture Models w i μ p x ( ) Parameters i Σ i Dim 2 Dim 1 MIT Lincoln Laboratory

Recognition Systems Gaussian Mixture Models p x ( ) Parameters Model States Dim 2 Dim 1 MIT Lincoln Laboratory

Recognition Systems Language, Speaker, and Dialect Models Languages, Dialects, or Speakers Model λ λ x ( | ) p 2 C Parameters Model λ Model λ 1 3 Model States In LID, DID, and SID, λ we train a set of target models C for each dialect, language, or speaker Dim 2 Dim 1 MIT Lincoln Laboratory

Recognition Systems Universal Background Model λ x ( | ) p C Parameters λ Model C Model States We also train a universal background Dim 2 Dim 1 λ model representing all speech C MIT Lincoln Laboratory

Recognition Systems Hypothesis Test : is from the hypothesized class H X 0 test • Given a set of test : is not from the hypothesized class H X 1 test observations , we perform a hypothesis test to determine whether a certain class produced it = x x � x { , , , } X 1 2 test K Dim 2 Dim 1 MIT Lincoln Laboratory

Recognition Systems Hypothesis Test : is from the hypothesized class H X 0 test • Given a set of test : is not from the hypothesized class H X 1 test observations , we perform a hypothesis test to determine whether a certain class λ x ( | ) p 1 produced it = x x � x { , , , } X 1 2 test K 0 ? H Dim 2 Dim 1 λ x ( | ) p C 1 ? H Dim 2 Dim 1 Dim 2 Dim 1 MIT Lincoln Laboratory

Recognition Systems Hypothesis Test • Given a set of test observations , we perform a hypothesis test to determine whether a certain class λ x ( | ) p 1 produced it = x x � x { , , , } X 1 2 test K English? Dim 2 Dim 1 λ x ( | ) p C Not English? Dim 2 Dim 1 Dim 2 Dim 1 MIT Lincoln Laboratory

Recognition Systems Log-Likelihood Ratio Score • We determine which hypothesis is true using the ratio: ≥ ⎧ threshold, accept H ( | ) p X H 0 ⎨≤ 0 ⎩ threshold, reject ( | ) p X H H 0 1 • We use the log-likelihood ratio score to decide whether an observed speaker, language, or dialect is the target Λ = λ − λ ( ) log[ ( | )] log[ ( | )] X p X p X C C ≥ λ ⎧ threshold, generated by X Λ C ⎨< ( ) X λ threshold, generated by ⎩ X C MIT Lincoln Laboratory

Recognition Systems Log-Likelihood Computation λ • The observation log-likelihood given a model is: λ x ( | ) p p X λ log[ ( | )]? Dim 2 Dim 1 Dim 2 Dim 1 MIT Lincoln Laboratory

Recognition Systems Log-Likelihood Computation λ • The observation log-likelihood given a model is: λ x ( | ) p p X λ log[ ( | )]? Dim 2 Dim 1 Dim 2 Dim 1 ⎛ ⎞ ( ) K M ∑ ∑ λ = − − Σ − − x μ 1 x μ T 1 1 log[ ( | )] log exp ( ) ( ) p X ⎜ C ⎟ 2 K i i i i ⎝ ⎠ = 1 1 i Dot product MIT Lincoln Laboratory

Recognition Systems Log-Likelihood Computation λ • The observation log-likelihood given a model is: λ x ( | ) p p X λ log[ ( | )]? Dim 2 Dim 1 Dim 2 Dim 1 ⎛ ⎞ ( ) K M ∑ ∑ λ = − − Σ − − x μ 1 x μ T 1 1 log[ ( | )] log exp ( ) ( ) p X ⎜ C ⎟ 2 K i i i i ⎝ ⎠ = 1 1 i Constant derived from weight and covariance MIT Lincoln Laboratory

Language, Dialect, and Speaker Recognition Using Gaussian Mixture - PowerPoint PPT Presentation

Language, Dialect, and Speaker Recognition Using Gaussian Mixture Models on the Cell Processor Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner {nmalyska, smohindra, karen.lauro, reynolds, kepner}@ll.mit.edu

Scheme Announcements Scheme Scheme is a Dialect of Lisp 4 Scheme is a Dialect of Lisp What

61A Lecture 25 Friday, October 26 Scheme is a Dialect of Lisp 2 Scheme is a Dialect of Lisp

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

The dialect of the Holy Island Overview of Lindisfarne Background Warren Maguire (University

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Speaker Recognition and Speaker Recognition and the ETSI Standard the ETSI Standard Distributed

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

The dialect of the Holy Island of the language of the middle and southeast of the county

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Lecture #23: The Scheme Language Scheme is a dialect of Lisp: The only programming language

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Photoshopping and Video Editing By Mitchell Schirmers History of photo and video editing

08-07337 LA-UR- Approved for public release; distribution is unlimited. Title: Petascale

Beta Presentation Mobile Application for XCP Measurement and Calibration The Capstone Experience

REAL PERFORMANCE RESULTS WITH VMWARE HORIZON AND VIEWPLANNER Manvender Rawat, NVIDIA Jason K.

Background Research Presentation Jashan Chopra Project Description My project is going to

Meet the Big Time Spatio-Temporal Regularization over Many Frames Alistair Boyle

Kingwood Photoclub ~ February 2016 Kingwood Photoclub ~ February 2016 What is a GoPro? /

Recent Research ch on Lightning, with Implications f s for Air Terminals William William am