Language, Dialect, and Speaker Recognition Using Gaussian Mixture Models on the Cell Processor Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner {nmalyska, smohindra, karen.lauro, reynolds, kepner}@ll.mit.edu This work is sponsored by the United States Air Force under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory
Outline • Introduction • Recognition for speech applications using GMMs • Parallel implementation of the GMM • Performance model • Conclusions and future work MIT Lincoln Laboratory
Introduction Automatic Recognition Systems • In this presentation, we will discuss technology that can be applied to different kinds of recognition systems – Language recognition – Dialect recognition – Speaker recognition Who is the speaker? What dialect are they using? What language are they speaking? MIT Lincoln Laboratory
Introduction The Scale Challenge • Speech processing problems are often described as one person interacting with a single computer system and receiving a response MIT Lincoln Laboratory
Introduction The Scale Challenge • Real speech applications, however, often involve data from multiple talkers and use multiple networked multicore machines – Interactive voice response systems – Voice portals – Large corpus evaluations with hundreds of hours of data Information About Speaker, Dialect, or Language MIT Lincoln Laboratory
Introduction The Computational Challenge • Speech-processing algorithms are computationally expensive • Large amounts of data need to be available for these applications – Must cache required data efficiently so that it is quickly available • Algorithms must be parallelized to maximize throughput – Conventional approaches focus on parallel solutions over multiple networked computers – Existing packages not optimized for high-performance-per-watt machines with multiple cores, required in embedded systems with power, thermal, and size constraints – Want highly-responsive “real-time” systems in many applications, including in embedded systems MIT Lincoln Laboratory
Outline • Introduction • Recognition for speech applications using GMMs • Parallel implementation of the GMM • Performance model • Conclusions and future work MIT Lincoln Laboratory
Recognition Systems Summary • A modern language, dialect, or speaker recognition system is composed of two main stages – Front-end processing – Pattern recognition Decision on the Pattern Front End identity, dialect, Speech Recognition or language of speaker • We will show how a speech signal is processed by modern recognition systems – Focus on a recognition technology called Gaussian mixture models MIT Lincoln Laboratory
Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory
Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory
Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory
Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory
Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory
Recognition Systems Front-End Processing • Front-end processing converts observed speech frames into an alternative representation, features – Lower dimensionality – Carries information relevant to the problem Speech Frames Feature Vectors X = x x � x { , , , } 1 2 K Dim 1 Front End Dim 2 x x x x 1 2 3 4 Feature Number Frame Number MIT Lincoln Laboratory
Recognition Systems Pattern Recognition Training Training Features • A recognition system makes decisions about observed Dim 1 data based on a knowledge of past data Dim 2 x x x x 1 2 3 4 • During training , the system learns about the data it uses to make decisions – A set of features are collected from a certain language, dialect, or speaker MIT Lincoln Laboratory
Recognition Systems Pattern Recognition Training Training Features • A recognition system makes decisions about observed data based on a knowledge of past data x • During training , the system 2 x 1 learns about the data it uses Dim 2 Dim 1 to make decisions – A set of features are Model p x collected from a certain ( ) language, dialect, or speaker – A model is generated to represent the data Dim 2 Dim 1 MIT Lincoln Laboratory
Recognition Systems Gaussian Mixture Models • A Gaussian mixture model (GMM) represents features as the weighted sum of multiple Gaussian distributions • Each Gaussian state i has a Model λ μ – Mean i λ x ( | ) p Σ – Covariance i – Weight w i Dim 2 Dim 1 MIT Lincoln Laboratory
Recognition Systems Gaussian Mixture Models w i μ p x ( ) Parameters i Σ i Dim 2 Dim 1 MIT Lincoln Laboratory
Recognition Systems Gaussian Mixture Models p x ( ) Parameters Model States Dim 2 Dim 1 MIT Lincoln Laboratory
Recognition Systems Language, Speaker, and Dialect Models Languages, Dialects, or Speakers Model λ λ x ( | ) p 2 C Parameters Model λ Model λ 1 3 Model States In LID, DID, and SID, λ we train a set of target models C for each dialect, language, or speaker Dim 2 Dim 1 MIT Lincoln Laboratory
Recognition Systems Universal Background Model λ x ( | ) p C Parameters λ Model C Model States We also train a universal background Dim 2 Dim 1 λ model representing all speech C MIT Lincoln Laboratory
Recognition Systems Hypothesis Test : is from the hypothesized class H X 0 test • Given a set of test : is not from the hypothesized class H X 1 test observations , we perform a hypothesis test to determine whether a certain class produced it = x x � x { , , , } X 1 2 test K Dim 2 Dim 1 MIT Lincoln Laboratory
Recognition Systems Hypothesis Test : is from the hypothesized class H X 0 test • Given a set of test : is not from the hypothesized class H X 1 test observations , we perform a hypothesis test to determine whether a certain class λ x ( | ) p 1 produced it = x x � x { , , , } X 1 2 test K 0 ? H Dim 2 Dim 1 λ x ( | ) p C 1 ? H Dim 2 Dim 1 Dim 2 Dim 1 MIT Lincoln Laboratory
Recognition Systems Hypothesis Test • Given a set of test observations , we perform a hypothesis test to determine whether a certain class λ x ( | ) p 1 produced it = x x � x { , , , } X 1 2 test K English? Dim 2 Dim 1 λ x ( | ) p C Not English? Dim 2 Dim 1 Dim 2 Dim 1 MIT Lincoln Laboratory
Recognition Systems Log-Likelihood Ratio Score • We determine which hypothesis is true using the ratio: ≥ ⎧ threshold, accept H ( | ) p X H 0 ⎨≤ 0 ⎩ threshold, reject ( | ) p X H H 0 1 • We use the log-likelihood ratio score to decide whether an observed speaker, language, or dialect is the target Λ = λ − λ ( ) log[ ( | )] log[ ( | )] X p X p X C C ≥ λ ⎧ threshold, generated by X Λ C ⎨< ( ) X λ threshold, generated by ⎩ X C MIT Lincoln Laboratory
Recognition Systems Log-Likelihood Computation λ • The observation log-likelihood given a model is: λ x ( | ) p p X λ log[ ( | )]? Dim 2 Dim 1 Dim 2 Dim 1 MIT Lincoln Laboratory
Recognition Systems Log-Likelihood Computation λ • The observation log-likelihood given a model is: λ x ( | ) p p X λ log[ ( | )]? Dim 2 Dim 1 Dim 2 Dim 1 ⎛ ⎞ ( ) K M ∑ ∑ λ = − − Σ − − x μ 1 x μ T 1 1 log[ ( | )] log exp ( ) ( ) p X ⎜ C ⎟ 2 K i i i i ⎝ ⎠ = 1 1 i Dot product MIT Lincoln Laboratory
Recognition Systems Log-Likelihood Computation λ • The observation log-likelihood given a model is: λ x ( | ) p p X λ log[ ( | )]? Dim 2 Dim 1 Dim 2 Dim 1 ⎛ ⎞ ( ) K M ∑ ∑ λ = − − Σ − − x μ 1 x μ T 1 1 log[ ( | )] log exp ( ) ( ) p X ⎜ C ⎟ 2 K i i i i ⎝ ⎠ = 1 1 i Constant derived from weight and covariance MIT Lincoln Laboratory
Recommend
More recommend