eecs e6870 speech recognition
play

EECS E6870 - Speech Recognition Administrivia Lecture 11 Linear - PowerPoint PPT Presentation

Outline of Todays Lecture EECS E6870 - Speech Recognition Administrivia Lecture 11 Linear Discriminant Analysis Maximum Mutual Information Training ROVER Stanley F . Chen,


  1. ✟ ✁ ✝✞ ✄☎ ✂ ✆ � ✠✡ ☛ Outline of Today’s Lecture EECS E6870 - Speech Recognition ■ Administrivia Lecture 11 ■ Linear Discriminant Analysis ■ Maximum Mutual Information Training ■ ROVER Stanley F . Chen, Michael A. Picheny and Bhuvana Ramabhadran ■ Consensus Decoding IBM T.J. Watson Research Center Yorktown Heights, NY, USA Columbia University stanchen@us.ibm.com, picheny@us.ibm.com, bhuvana@us.ibm.com 24 November 2009 EECS E6870: Advanced Speech Recognition EECS E6870: Advanced Speech Recognition 1 Administrivia Linear Discriminant Analysis See A way to achieve robustness is to extract features that emphasize sound discriminability and ignore irrelevant sources of http://www.ee.columbia.edu/ stanchen/fall09/e6870/readings/project f09.html information. LDA tries to achieve this via a linear transform of the feature data. for suggested readings and presentation guidelines for final project. If the main sources of class variation lie along the coordinate axes there is no need to do anything even if assuming a diagonal covariance matrix (as in most HMM models): EECS E6870: Advanced Speech Recognition 2 EECS E6870: Advanced Speech Recognition 3

  2. ✝✞ � ✄☎ ☛ ✠✡ ✆ ✂ ✁ ✟ Principle Component Analysis-Motivation Linear Discriminant Analysis - Motivation If the main sources of class variation lie along the main source If the main sources of class variation do NOT lie along the main of variation we may want to rotate the coordinate axis (if using source of variation we need to find the best directions: diagonal covariances): EECS E6870: Advanced Speech Recognition 4 EECS E6870: Advanced Speech Recognition 5 this matrix is a polynomial (called the characteristic polynomial ) Eigenvectors and Eigenvalues p ( λ ) . The roots of this polynomial will be the eigenvalues of A . A key concept in feature selection are the eigenvalues and For example, let us say eigenvectors of a matrix. � 2 � − 4 A = . The eigenvalues and eigenvectors of a matrix are defined by the − 1 − 1 following matrix equation: In such a case, Ax = λ x � � 2 − λ − 4 � � p ( λ ) = � � − 1 − 1 − λ � � For a given matrix A the eigenvectors are defined as those = (2 − λ )( − 1 − λ ) − ( − 4)( − 1) vectors x for which the above statement is true. Each eigenvector λ 2 − λ − 6 = has an associated eigenvalue, λ . To solve this equation, we can rewrite it as = ( λ − 3)( λ + 2) ( A − λ I ) x = 0 Therefore, λ 1 = 3 and λ 2 = − 2 are the eigenvalues of A . If x is non-zero, the only way this equation can be solved is if the determinant of the matrix ( A − λ I ) is zero. The determinant of To find the eigenvectors, we simply plug in the eigenvalues into EECS E6870: Advanced Speech Recognition 6 EECS E6870: Advanced Speech Recognition 7

  3. ✆ ✝✞ ✄☎ ☛ � ✁ ✂ ✠✡ ✟ ( A − λ I ) x = 0 and solve for x . For example, for λ 1 = 3 we get Principle Component Analysis-Derivation � 2 − 3 � 0 � � x 1 � � − 4 = First consider the problem of best representing a set of vectors − 1 − 1 − 3 0 x 2 x 1 , x 2 , . . . , x n by a single vector x 0 . More specifically let us try to Solving this, we find that x 1 = − 4 x 2 , so all the eigenvector minimize the sum of the squared distances from x 0 corresponding to λ 1 = 3 is a multiple of [ − 4 1] T . Similarly, we find that the eigenvector corresponding to λ 1 = − 2 is a multiple of N [1 1] T . � | x k − x 0 | 2 J 0 ( x 0 ) = k =1 It is easy to show that the sample mean, m , minimizes J 0 , where m is given by N m = x 0 = 1 � x k N k =1 EECS E6870: Advanced Speech Recognition 8 EECS E6870: Advanced Speech Recognition 9 mean square error: N � | x k − ( m + a k e ) | 2 J 1 ( a 1 , a 2 , . . . , a N , e ) = k =1 If we differentiate the above with respect to a k we get a k = e T ( x k − m ) Now, let e be a unit vector in an arbitrary direction. In such a case, i.e. we project x k onto the line in the direction of e that passes we can express a vector x as through the sample mean m . How do we find the best direction e ? If we substitute the above solution for a k into the formula for the overall mean square error we get after some manipulation: x = m + a e N � J 1 ( e ) = − e T Se + | x k − m | 2 For the vectors x k we can find a set of a k s that minimizes the k =1 EECS E6870: Advanced Speech Recognition 10 EECS E6870: Advanced Speech Recognition 11

  4. ✆ � ✝✞ ☛ ✠✡ ✄☎ ✟ ✂ ✁ where S is called the Scatter matrix and is given by: So to maximize e T Se we want to select the eigenvector of S corresponding to the largest eigenvalue of S . N � ( x k − m )( x k − m ) T S = k =1 Notice the scatter matrix just looks like N times the sample covariance matrix of the data. To minimize J 1 we want to maximize e T Se subject to the constraint that | e | = e T e = 1 . Using Lagrange multipliers we write u = e T Se − λ e T e . Differentiating u w.r.t e and setting to zero we get: If we now want to find the best d directions, the problem is now to express x as 2 Se − 2 λ e = 0 d or � x = m + a i e i Se = λ e i =1 EECS E6870: Advanced Speech Recognition 12 EECS E6870: Advanced Speech Recognition 13 In this case, we can write the mean square error as Linear Discriminant Analysis - Derivation N d � � a ki e i ) − x k | 2 J d = | ( m + Let us say we have vectors corresponding to c classes of data. k =1 i =1 We can define a set of scatter matrices as above as and it is not hard to show that J d is minimized when the vectors e 1 , e 2 , . . . , e d correspond to the d largest eigenvectors of the � ( x − m i )( x − m i ) T S i = scatter matrix S . x ∈D i where m i is the mean of class i . In this case we can define the within-class scatter (essentially the average scatter across the classes relative to the mean of each class) as just: c � S W = S i i =1 EECS E6870: Advanced Speech Recognition 14 EECS E6870: Advanced Speech Recognition 15

  5. ✟ ✆ ☛ ✠✡ ✄☎ ✝✞ � ✁ ✂ We would like to determine a set of projection directions V Another useful scatter matrix is the between class scatter matrix, such that the classes c are maximally discriminable in the new defined as coordinate space given by c � ( m i − m )( m i − m ) T S B = ˜ x = Vx i =1 EECS E6870: Advanced Speech Recognition 16 EECS E6870: Advanced Speech Recognition 17 as: c ˜ � m ) T = ( ˜ m i − ˜ m )( ˜ m i − ˜ S B i =1 c � V ( m i − m )( m i − m ) T V T = i =1 VS B V T = A reasonable measure of discriminability is the ratio of the and similarly for S W so the discriminability measure becomes volumes represented by the scatter matrices. Since the determinant of a matrix is a measure of the corresponding volume, J ( V ) = | VS B V T | we can use the ratio of determinants as a measure: | VS W V T | J = | S B | With a little bit of manipulation similar to that in PCA, it turns out | S W | that the solution are the eigenvectors of the matrix So we want to find a set of directions that maximize this S − 1 W S B expression. In the new space, we can write the above expression EECS E6870: Advanced Speech Recognition 18 EECS E6870: Advanced Speech Recognition 19

  6. ✆ � ✄☎ ✂ ✁ ✝✞ ✟ ✠✡ ☛ which can be generated by most common mathematical Linear Discriminant Analysis in Speech packages. Recognition The most successful uses of LDA in speech recognition are achieved in an interesting fashion. ■ Speech recognition training data are aligned against the underlying words using the Viterbi alignment algorithm described in Lecture 4. ■ Using this alignment, each cepstral vector is tagged with a different phone or sub-phone. For English this typically results in a set of 156 (52x3) classes. ■ For each time t the cepstral vector x t is spliced together with N/ 2 vectors on the left and right to form a “supervector” of N cepstral vectors. ( N is typically 5-9 frames.) Call this “supervector” y t . EECS E6870: Advanced Speech Recognition 20 EECS E6870: Advanced Speech Recognition 21 Training via Maximum Mutual Information The Fundamental Equation of Speech Recognition states that p ( S | O ) = p ( O | S ) p ( S ) /P ( O ) where S is the sentence and O are our observations. We model p ( O | S ) using Hidden Markov Models (HMMs). The HMMs themselves have a set of parameters θ that are estimated from a set of training data, so it is convenient to write this dependence explicitly: p θ ( O | S ) . ■ The LDA procedure is applied to the supervectors y t . ■ The top M directions (usually 40-60) are chosen and the We estimate the parameters θ to maximize the likelihood of the training data. Although this seems to make some intuitive sense, supervectors y t are projected into this lower dimensional space. is this what we are after? ■ The recognition system is retrained on these lower dimensional vectors. Not really! (Why?). So then, why is ML estimation a good thing? ■ Performance improvements of 10%-15% are typical. EECS E6870: Advanced Speech Recognition 22 EECS E6870: Advanced Speech Recognition 23

Recommend


More recommend