elen e6884 topics in signal processing
play

ELEN E6884 - Topics in Signal Processing Recap Topic: Speech - PowerPoint PPT Presentation

Outline of Todays Lecture ELEN E6884 - Topics in Signal Processing Recap Topic: Speech Recognition Gaussian Mixture Models - A Gaussian Mixture Models - B Lecture 3


  1. ✟ ✂ ✆ ☛ ✠✡ ✄☎ ✝✞ � ✁ Outline of Today’s Lecture ELEN E6884 - Topics in Signal Processing ■ Recap Topic: Speech Recognition ■ Gaussian Mixture Models - A ■ Gaussian Mixture Models - B Lecture 3 ■ Introduction to Hidden Markov Models Stanley F . Chen, Michael A. Picheny, and Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, USA stanchen@us.ibm.com, picheny@us.ibm.com, bhuvana@us.ibm.com 22 September 2009 EECS E6870: Advanced Speech Recognition EECS E6870: Advanced Speech Recognition 1 Where are We? Administrivia ■ main feedback from last lecture ■ Can extract feature vectors over time - LPC, MFCC, or PLPs ● EEs: Speed ok - that characterize the information in a speech signal in a ● CSs: Hard to follow relatively compact form. ■ Can perform simple speech recognition by ■ Remedy: Only one more lecture will have serious signal ● Building templates consisting of sequences of feature vectors processing content so don’t worry! extracted from a set of words ■ Lab 1 due Sept 30 (don’t wait until the last minute!) ● Comparing the feature vectors for a new utterance against all the templates using DTW and picking the best scoring template ■ Learned about some basic concepts (e.g., graphs, distance measures, shortest paths) that will appear over and over again throughout the course EECS E6870: Advanced Speech Recognition 2 EECS E6870: Advanced Speech Recognition 3

  2. � ✆ ☛ ✠✡ ✟ ✄☎ ✁ ✂ ✝✞ What are the Pros and Cons of DTW Pros ■ Easy to implement and compute ■ Lots of freedom - can model arbitrary time warpings EECS E6870: Advanced Speech Recognition 4 EECS E6870: Advanced Speech Recognition 5 How can we Do Better? Cons ■ Distance measures completely heuristic. ■ Key insight 1: Learn as much as possible from data - Distance ● Why Euclidean? Are all dimensions of the feature vector measure, weights on graph, even graph structure itself (future research) created equal? ■ Key insight 2: Use well-described theories and models from ■ Warping paths heuristic ● Too much freedom is not always a good thing for robustness probability, statistics, and computer science to describe the ● Allowable path moves all hand-derived data rather than developing new heuristics with ill-defined mathematical properties ■ No guarantees of optimality or convergence ■ Start by modeling the behavior of the distribution of feature vectors associated with different speech sounds leading to a particular set of models called Gaussian Mixture Models - a formalism of the concept of the distance measure. ■ Then derive models for describing the time evolution of feature vectors for speech sounds and words, called Hidden Markov Models, a generalization of the template idea in DTW. EECS E6870: Advanced Speech Recognition 6 EECS E6870: Advanced Speech Recognition 7

  3. ✄☎ ✟ ✠✡ ✆ � ✝✞ ✂ ☛ ✁ Gaussian Mixture Model Overview How do we Capture Variability? ■ Motivation for using Gaussians ■ Univariate Gaussians ■ Multivariate Gaussians ■ Estimating parameters for Gaussian Distributions ■ Need for Mixtures of Gaussians ■ Estimating parameters for Gaussian Mixtures ■ Initialization Issues ■ How many Gaussians? EECS E6870: Advanced Speech Recognition 8 EECS E6870: Advanced Speech Recognition 9 Data Models The Gaussian Distribution A lot of different types of data are distributed like a “bell-shaped EECS E6870: Advanced Speech Recognition 10 EECS E6870: Advanced Speech Recognition 11

  4. ✟ ✆ ✠✡ ✝✞ � ✄☎ ✁ ☛ ✂ curve”. Mathematically we can represent this by what is called a Advantages of Gaussian Distributions Gaussian or Normal distribution: ■ Central Limit Theorem: Sums of large numbers of identically 2 πσ e − ( O − µ )2 1 distributed random variables tend to Gaussian N ( µ, σ ) = √ 2 σ 2 ■ The sums and differences of Gaussian random variables are µ is called the mean and σ 2 is called the variance. The value at also Gaussian ■ If X is distributed as N ( µ, σ ) then aX + b is distributed as a particular point O is called the likelihood . The integral of the N ( aµ + b, ( aσ ) 2 ) above distribution is 1 : � ∞ 2 πσ e − ( O − µ )2 1 2 σ 2 dO = 1 √ −∞ It is often easier to work with the logarithm of the above: √ 2 πσ − ( O − µ ) 2 − ln 2 σ 2 which looks suspiciously like a weighted Euclidean distance! EECS E6870: Advanced Speech Recognition 12 EECS E6870: Advanced Speech Recognition 13 Gaussians in Two Dimensions ( O 1 − µ 1)2 σ 1 σ 2 + ( O 2 − µ 2)2 � � − 2 rO 1 O 2 1 1 − 2(1 − r 2) σ 2 σ 2 N ( µ 1 , µ 2 , σ 1 , σ 2 ) = √ 1 2 1 − r 2 e 2 πσ 1 σ 2 If r = 0 can write the above as − ( O 1 − µ 1)2 − ( O 2 − µ 2)2 1 1 2 σ 2 2 σ 2 √ √ 1 2 e e 2 πσ 1 2 πσ 2 EECS E6870: Advanced Speech Recognition 14 EECS E6870: Advanced Speech Recognition 15

  5. ✄☎ � ✆ ✝✞ ✂ ✁ ✟ ✠✡ ☛ If we write the following matrix: For most problems we will encounter in speech recognition, we will assume that Σ is diagonal so we may write the above as: σ 2 � � rσ 1 σ 2 � 1 � Σ = � σ 2 � n n rσ 1 σ 2 − n ln σ i − 1 � 2 � � � ( O i − µ i ) 2 /σ 2 2 ln(2 π ) − i 2 i =1 i =1 using the notation of linear algebra, we can write Again, note the similarity to a weighted Euclidean distance. 1 2 ( O − µ ) T Σ − 1 ( O − µ ) (2 π ) n/ 2 | Σ | 1 / 2 e − 1 N ( µ, Σ ) = where O = ( O 1 , O 2 ) and µ = ( µ 1 , µ 2 ) . More generally, µ and Σ can have arbitrary numbers of components, in which case the above is called a multivariate Gaussian. We can write the logarithm of the multivariate likelihood of the Gaussian as: 2 ln(2 π ) − 1 2 ln | Σ | − 1 − n 2( O − µ ) T Σ − 1 ( O − µ ) EECS E6870: Advanced Speech Recognition 16 EECS E6870: Advanced Speech Recognition 17 Estimating Gaussians Maximum-Likelihood Estimation Given a set of observations O 1 , O 2 , . . . , O N it can be shown that For simplicity, we will assume a univariate Gaussian. We can write µ and Σ can be estimated as: the likelihood of a string of observations O N 1 = O 1 , O 2 , . . . , O N as the product of the individual likelihoods: N µ = 1 � O i N i =1 N 2 πσ e − ( Oi − µ )2 1 � L ( O N 1 | µ, σ ) = √ 2 σ 2 and i =1 N Σ = 1 � ( O i − µ ) T ( O i − µ ) N It is much easier to work with L = ln L : i =1 How do we actually derive these formulas? N ( O i − µ ) 2 2 ln 2 πσ 2 − 1 1 | µ, σ ) = − N L ( O N � 2 σ 2 i =1 To find µ and σ we can take the partial derivatives of the above EECS E6870: Advanced Speech Recognition 18 EECS E6870: Advanced Speech Recognition 19

  6. ✟ � ☛ ✠✡ ✆ ✄☎ ✝✞ ✂ ✁ expressions: Problems with Gaussian Assumption N ∂L ( O N 1 | µ, σ ) ( O i − µ ) � = (1) σ 2 ∂µ i =1 N ∂L ( O N ( O i − µ ) 2 1 | µ, σ ) − N � = 2 σ 2 + (2) ∂σ 2 σ 4 i =1 By setting the above terms equal to zero and solving for µ and σ we obtain the classic formulas for estimating the means What can we do? Well, in this case, we can try modeling this with and variances. Since we are setting the parameters based on two Gaussians: maximizing the likelihood of the observations, this process is called Maximum-Likelihood Estimation, or just ML estimation. − ( O − µ 1)2 − ( O − µ 2)2 1 1 2 σ 2 2 σ 2 L ( O ) = p 1 √ e + p 2 √ e 1 2 2 πσ 1 2 πσ 2 where p 1 + p 2 = 1 . EECS E6870: Advanced Speech Recognition 20 EECS E6870: Advanced Speech Recognition 21 More generally, we can use an arbitrary number of Gaussians: Issues with ML Estimation of GMMs − ( O − µi )2 How many Gaussians? (to be discussed later....) 1 2 σ 2 � p i e i √ Infinite solutions: For the two-mixture case above, we can write 2 πσ i i the overall log likelihood of the data as: this is generally referred to as a Mixture of Gaussians or a   − ( Oi − µ 1)2 − ( Oi − µ 2)2 N Gaussian Mixture Model or GMM . Essentially any distribution of 1 1 � 2 σ 2 2 σ 2  p 1 e + p 2 e ln 1 2 √ √ interest can be modeled with GMMs.  2 πσ 1 2 πσ 2 i =1 Say we set µ 1 = O 1 . We can then write the above as   ( O 1 − µ 2)2 N 1 1 1 2 σ 2 �  + e . . . ln + 2 √ √  2 2 πσ 1 2 2 πσ 2 i =2 which clearly goes to ∞ as σ 1 → 0 . Empirically we can restrict our attention to the finite local maxima of the likelihood function. EECS E6870: Advanced Speech Recognition 22 EECS E6870: Advanced Speech Recognition 23

Recommend


More recommend