lecture 13 how to train observation probability densities
play

Lecture 13: How to train Observation Probability Densities Mark - PowerPoint PPT Presentation

Review Softmax Gaussians Discrete Summary Lecture 13: How to train Observation Probability Densities Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. ECE 417: Multimedia Signal Processing, Fall 2020 Review Softmax


  1. Review Softmax Gaussians Discrete Summary Lecture 13: How to train Observation Probability Densities Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. ECE 417: Multimedia Signal Processing, Fall 2020

  2. Review Softmax Gaussians Discrete Summary Review: Hidden Markov Models 1 Softmax Observation Probabilities 2 Gaussian Observation Probabilities 3 Discrete Observation Probabilities 4 Summary 5

  3. Review Softmax Gaussians Discrete Summary Outline Review: Hidden Markov Models 1 Softmax Observation Probabilities 2 Gaussian Observation Probabilities 3 Discrete Observation Probabilities 4 Summary 5

  4. Review Softmax Gaussians Discrete Summary Hidden Markov Model a 13 a 11 a 22 a 33 a 12 a 23 1 2 3 a 21 a 32 a 31 b 1 ( � x ) b 2 ( � x ) b 3 ( � x ) � � � x x x 1 Start in state q t = i with pmf π i . 2 Generate an observation, � x , with pdf b i ( � x ). 3 Transition to a new state, q t +1 = j , according to pmf a ij . 4 Repeat.

  5. Review Softmax Gaussians Discrete Summary The Forward Algorithm Definition: α t ( i ) ≡ p ( � x 1 , . . . , � x t , q t = i | Λ). Computation: 1 Initialize: α 1 ( i ) = π i b i ( � x 1 ) , 1 ≤ i ≤ N 2 Iterate: N � 1 ≤ j ≤ N , 2 ≤ t ≤ T α t ( j ) = α t − 1 ( i ) a ij b j ( � x t ) , i =1 3 Terminate: N � p ( X | Λ) = α T ( i ) i =1

  6. Review Softmax Gaussians Discrete Summary The Backward Algorithm Definition: β t ( i ) ≡ p ( � x T | q t = i , Λ). Computation: x t +1 , . . . , � 1 Initialize: β T ( i ) = 1 , 1 ≤ i ≤ N 2 Iterate: N � β t ( i ) = a ij b j ( � x t +1 ) β t +1 ( j ) , 1 ≤ i ≤ N , 1 ≤ t ≤ T − 1 j =1 3 Terminate: N � p ( X | Λ) = π i b i ( � x 1 ) β 1 ( i ) i =1

  7. Review Softmax Gaussians Discrete Summary The Baum-Welch Algorithm 1 Initial State Probabilities: � sequences γ 1 ( i ) π ′ i = # sequences 2 Transition Probabilities: � T − 1 t =1 ξ t ( i , j ) a ′ ij = � N � T − 1 t =1 ξ t ( i , j ) j =1 3 Observation Probabilities: T N L = − 1 � � γ t ( i ) ln b i ( � x t ) T t =1 i =1

  8. Review Softmax Gaussians Discrete Summary Outline Review: Hidden Markov Models 1 Softmax Observation Probabilities 2 Gaussian Observation Probabilities 3 Discrete Observation Probabilities 4 Summary 5

  9. Review Softmax Gaussians Discrete Summary Review: Conditional Probability The relationship among posterior, prior, evidence and likelihood is p ( q | � x ) p ( � x ) = p ( � x | q ) p ( q ) Since softmax is normalized so that 1 = � q softmax( e [ q ]), it makes most sense to interpret softmax( e [ q ]) = p ( q | � x ). Therefore, the likelihood should be x | q ) = p ( � x ) softmax( e [ q ]) b q ( � x ) ≡ p ( � p ( q )

  10. Review Softmax Gaussians Discrete Summary Relationship between the likelihood and the posterior Therefore, the likelihood should be x | q ) = p ( � x ) softmax( e [ q ]) b q ( � x ) ≡ p ( � p ( q ) However, If we choose training data with equal numbers of each phone, then we can assume p ( q ) = 1 / N . p ( � x ) is independent of q , so it doesn’t affect recognition. So let’s assume that p ( � x ) = 1 / N also.

  11. Review Softmax Gaussians Discrete Summary Softmax Observation Probabilities Given the assumptions that p ( q ) = p ( � x ) = 1 / N , b q ( � x ) = p ( � x | q ) = p ( q | � x ) = softmax( e [ q ]) The assumptions are unrealistic. We sometimes need to adjust for low-frequency phones, in order to get good-quality recognition. But let’s first derive the solution given these assumptions, and then we’ll see if the assumptions can be relaxed.

  12. Review Softmax Gaussians Discrete Summary Softmax Observation Probabilities Given the assumptions that p ( q ) = p ( � x ) = 1 / N , exp( e [ q ]) b q ( � x ) = softmax( e [ q ]) = , � N ℓ =1 exp( e [ ℓ ]) where e [ i ] is the i th element of the output excitation row vector, e = � � hW , computed as the product of a weight matrix W with the hidden layer activation row vector, � h .

  13. Review Softmax Gaussians Discrete Summary Expected negative log likelihood The neural net is trained to minimize the expected negative log likelihood, a.k.a. the cross-entropy between γ t ( i ) and b i ( � x t ): T N L CE = − 1 � � γ t ( i ) ln b i ( � x t ) T t =1 i =1 e = � Remember that, since � hW , the weight gradient is just: T T d L CE d L CE ∂ e t [ k ] d L CE � � = = de t [ k ] h t [ j ] , de t [ k ] ∂ w jk dw jk t =1 t =1 where h t [ j ] is the j th component of � h at time t , and e t [ k ] is the k th component of � e at time t .

  14. Review Softmax Gaussians Discrete Summary Back-prop Let’s find the loss gradient w.r.t. e t [ k ]. The loss is T N L CE = − 1 � � γ t ( i ) ln b i ( � x t ) T t =1 i =1 so its gradient is N de t [ k ] = − 1 d L CE γ t ( i ) ∂ b i ( � x t ) � b i ( � x t ) ∂ e t [ k ] T i =1

  15. Review Softmax Gaussians Discrete Summary Differentiating the softmax The softmax is exp( e [ i ]) ℓ exp( e [ ℓ ]) = A b i ( � x ) = � B Its derivative is ∂ b i ( � ∂ e [ k ] = 1 x ) ∂ A ∂ B ∂ e [ k ] − A B 2 ∂ e [ k ] B  exp( e [ i ]) 2 exp( e [ i ]) ℓ exp( e [ ℓ ]) − i = k  2 � ( ℓ exp( e [ ℓ ]) )  � = − exp( e [ i ]) exp( e [ k ]) i � = k  2 ( ℓ exp( e [ ℓ ]) )  � � x ) − b 2 b i ( � i ( � x ) i = k = − b i ( � x ) b k ( � x ) i � = k

  16. Review Softmax Gaussians Discrete Summary The loss gradient The loss gradient it N de t [ k ] = − 1 d L CE γ t ( i ) ∂ b i ( � x t ) � T b i ( � x t ) ∂ e t [ k ] i =1   = − 1 �  γ t ( k )(1 − b k ( � x t )) − γ t ( i ) b k ( t )  T i � = k � N � = − 1 � γ t ( k ) − b k ( � x t ) γ t ( i ) T i =1 = − 1 T ( γ t ( k ) − b k ( � x t ))

  17. Review Softmax Gaussians Discrete Summary Summary: softmax observation probabilities Training W to minimize the cross-entropy between γ t ( i ) and b i ( t ), T N L CE = − 1 � � γ t ( i ) ln b i ( � x t ) , T t =1 i =1 yields the following weight gradient: T d L CE = − 1 � h t [ j ] ( γ t ( k ) − b k ( � x t )) dw jk T t =1 which vanishes when the neural net estimates b k ( � x t ) → γ t ( k ) as well as it can.

  18. Review Softmax Gaussians Discrete Summary Summary: softmax observation probabilities The Baum-Welch algorithm alternates between two types of estimation, often called the E-step (expectation) and the M-step (maximization or minimization): 1 E-step: Use forward-backward algorithm to re-estimate γ t ( i ) = p ( q t = i | X , Λ). 2 M-step: Train the neural net for a few iterations of gradient descent, so that b k ( � x t ) → γ t ( k ).

  19. Review Softmax Gaussians Discrete Summary Final note: Those ridiculous assumptions As a final note, let’s see if we can eliminate those ridiculous assumptions, p ( q ) = p ( � x ) = 1 / N . How? Well, the weight gradient goes to zero when � T t =1 h t [ j ] ( γ t ( k ) − b k ( � x t )) = 0. There are at least two ways in which this can happen: 1 b k ( � x t ) = γ t ( k ). The neural net is successfully estimating the posterior. This is the best possible solution if x ) = 1 p ( q = i ) = p ( � N . 2 b k ( � x t ) − γ t ( k ) is uncorrelated with h t [ j ], e.g., because it is zero mean and independent of � x t .

  20. Review Softmax Gaussians Discrete Summary Final note: Those ridiculous assumptions The weight gradient goes to zero if γ t ( k ) − b k ( � x t ) is zero mean and independent of � x t . For example, b k ( � x ) might differ from γ t ( k ) by a global scale factor. Instead of softmax, we might use some other normalization, either because (a) it’s scaled more like a likelihood, or (b) it has nice numerical properties. An example of (b) is: exp( e [ i ]) b i ( � x ) = max j exp( e [ j ]) b k ( � x ) might differ from γ t ( k ) by a phone-dependent scale factor, e.g., we might choose x ) = p ( q = i | � x ) exp( e [ i ]) b i ( � = p ( q = i ) p ( q = i ) � N j =1 exp( e [ j ])

  21. Review Softmax Gaussians Discrete Summary Outline Review: Hidden Markov Models 1 Softmax Observation Probabilities 2 Gaussian Observation Probabilities 3 Discrete Observation Probabilities 4 Summary 5

  22. Review Softmax Gaussians Discrete Summary Baum-Welch with Gaussian Probabilities Baum-Welch asks us to minimize the cross-entropy between γ t ( i ) and b i ( � x t ): T N L CE = − 1 � � γ t ( i ) ln b i ( � x t ) T t =1 i =1 In order to force b i ( � x t ) to be a likelihood, rather than a posterior, one way is to use a function that is guaranteed to be a properly normalized pdf. For example, a Gaussian: b i ( � x ) = N ( � x ; � µ i , Σ i )

  23. Review Softmax Gaussians Discrete Summary Diagonal-Covariance Gaussian pdf Let’s assume the feature vector has D dimensions, � x = [ x 1 , . . . , x D ]. The Gaussian pdf is 1 (2 π ) D / 2 | Σ | 1 / 2 e − 1 µ )Σ − 1 ( � µ ) T 2 ( � x − � x − � N ( � x ; � µ, Σ) = Let’s assume a diagonal covariance matrix, Σ = diag( σ 2 1 , . . . , σ 2 D ), so that ( xd − µ d )2 − 1 � D 1 d =1 σ 2 2 N ( � x ; � µ, Σ) = e d �� D d =1 2 πσ 2 d

Recommend


More recommend