Lecture 13: How to train Observation Probability Densities Mark - PowerPoint PPT Presentation

Review Softmax Gaussians Discrete Summary Lecture 13: How to train Observation Probability Densities Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. ECE 417: Multimedia Signal Processing, Fall 2020

Review Softmax Gaussians Discrete Summary Review: Hidden Markov Models 1 Softmax Observation Probabilities 2 Gaussian Observation Probabilities 3 Discrete Observation Probabilities 4 Summary 5

Review Softmax Gaussians Discrete Summary Outline Review: Hidden Markov Models 1 Softmax Observation Probabilities 2 Gaussian Observation Probabilities 3 Discrete Observation Probabilities 4 Summary 5

Review Softmax Gaussians Discrete Summary Hidden Markov Model a 13 a 11 a 22 a 33 a 12 a 23 1 2 3 a 21 a 32 a 31 b 1 ( � x ) b 2 ( � x ) b 3 ( � x ) � � � x x x 1 Start in state q t = i with pmf π i . 2 Generate an observation, � x , with pdf b i ( � x ). 3 Transition to a new state, q t +1 = j , according to pmf a ij . 4 Repeat.

Review Softmax Gaussians Discrete Summary The Forward Algorithm Definition: α t ( i ) ≡ p ( � x 1 , . . . , � x t , q t = i | Λ). Computation: 1 Initialize: α 1 ( i ) = π i b i ( � x 1 ) , 1 ≤ i ≤ N 2 Iterate: N � 1 ≤ j ≤ N , 2 ≤ t ≤ T α t ( j ) = α t − 1 ( i ) a ij b j ( � x t ) , i =1 3 Terminate: N � p ( X | Λ) = α T ( i ) i =1

Review Softmax Gaussians Discrete Summary The Backward Algorithm Definition: β t ( i ) ≡ p ( � x T | q t = i , Λ). Computation: x t +1 , . . . , � 1 Initialize: β T ( i ) = 1 , 1 ≤ i ≤ N 2 Iterate: N � β t ( i ) = a ij b j ( � x t +1 ) β t +1 ( j ) , 1 ≤ i ≤ N , 1 ≤ t ≤ T − 1 j =1 3 Terminate: N � p ( X | Λ) = π i b i ( � x 1 ) β 1 ( i ) i =1

Review Softmax Gaussians Discrete Summary The Baum-Welch Algorithm 1 Initial State Probabilities: � sequences γ 1 ( i ) π ′ i = # sequences 2 Transition Probabilities: � T − 1 t =1 ξ t ( i , j ) a ′ ij = � N � T − 1 t =1 ξ t ( i , j ) j =1 3 Observation Probabilities: T N L = − 1 � � γ t ( i ) ln b i ( � x t ) T t =1 i =1

Review Softmax Gaussians Discrete Summary Review: Conditional Probability The relationship among posterior, prior, evidence and likelihood is p ( q | � x ) p ( � x ) = p ( � x | q ) p ( q ) Since softmax is normalized so that 1 = � q softmax( e [ q ]), it makes most sense to interpret softmax( e [ q ]) = p ( q | � x ). Therefore, the likelihood should be x | q ) = p ( � x ) softmax( e [ q ]) b q ( � x ) ≡ p ( � p ( q )

Review Softmax Gaussians Discrete Summary Relationship between the likelihood and the posterior Therefore, the likelihood should be x | q ) = p ( � x ) softmax( e [ q ]) b q ( � x ) ≡ p ( � p ( q ) However, If we choose training data with equal numbers of each phone, then we can assume p ( q ) = 1 / N . p ( � x ) is independent of q , so it doesn’t affect recognition. So let’s assume that p ( � x ) = 1 / N also.

Review Softmax Gaussians Discrete Summary Softmax Observation Probabilities Given the assumptions that p ( q ) = p ( � x ) = 1 / N , b q ( � x ) = p ( � x | q ) = p ( q | � x ) = softmax( e [ q ]) The assumptions are unrealistic. We sometimes need to adjust for low-frequency phones, in order to get good-quality recognition. But let’s first derive the solution given these assumptions, and then we’ll see if the assumptions can be relaxed.

Review Softmax Gaussians Discrete Summary Softmax Observation Probabilities Given the assumptions that p ( q ) = p ( � x ) = 1 / N , exp( e [ q ]) b q ( � x ) = softmax( e [ q ]) = , � N ℓ =1 exp( e [ ℓ ]) where e [ i ] is the i th element of the output excitation row vector, e = � � hW , computed as the product of a weight matrix W with the hidden layer activation row vector, � h .

Review Softmax Gaussians Discrete Summary Expected negative log likelihood The neural net is trained to minimize the expected negative log likelihood, a.k.a. the cross-entropy between γ t ( i ) and b i ( � x t ): T N L CE = − 1 � � γ t ( i ) ln b i ( � x t ) T t =1 i =1 e = � Remember that, since � hW , the weight gradient is just: T T d L CE d L CE ∂ e t [ k ] d L CE � � = = de t [ k ] h t [ j ] , de t [ k ] ∂ w jk dw jk t =1 t =1 where h t [ j ] is the j th component of � h at time t , and e t [ k ] is the k th component of � e at time t .

Review Softmax Gaussians Discrete Summary Back-prop Let’s find the loss gradient w.r.t. e t [ k ]. The loss is T N L CE = − 1 � � γ t ( i ) ln b i ( � x t ) T t =1 i =1 so its gradient is N de t [ k ] = − 1 d L CE γ t ( i ) ∂ b i ( � x t ) � b i ( � x t ) ∂ e t [ k ] T i =1

Review Softmax Gaussians Discrete Summary Differentiating the softmax The softmax is exp( e [ i ]) ℓ exp( e [ ℓ ]) = A b i ( � x ) = � B Its derivative is ∂ b i ( � ∂ e [ k ] = 1 x ) ∂ A ∂ B ∂ e [ k ] − A B 2 ∂ e [ k ] B  exp( e [ i ]) 2 exp( e [ i ]) ℓ exp( e [ ℓ ]) − i = k  2 � ( ℓ exp( e [ ℓ ]) )  � = − exp( e [ i ]) exp( e [ k ]) i � = k  2 ( ℓ exp( e [ ℓ ]) )  � � x ) − b 2 b i ( � i ( � x ) i = k = − b i ( � x ) b k ( � x ) i � = k

Review Softmax Gaussians Discrete Summary The loss gradient The loss gradient it N de t [ k ] = − 1 d L CE γ t ( i ) ∂ b i ( � x t ) � T b i ( � x t ) ∂ e t [ k ] i =1   = − 1 �  γ t ( k )(1 − b k ( � x t )) − γ t ( i ) b k ( t )  T i � = k � N � = − 1 � γ t ( k ) − b k ( � x t ) γ t ( i ) T i =1 = − 1 T ( γ t ( k ) − b k ( � x t ))

Review Softmax Gaussians Discrete Summary Summary: softmax observation probabilities Training W to minimize the cross-entropy between γ t ( i ) and b i ( t ), T N L CE = − 1 � � γ t ( i ) ln b i ( � x t ) , T t =1 i =1 yields the following weight gradient: T d L CE = − 1 � h t [ j ] ( γ t ( k ) − b k ( � x t )) dw jk T t =1 which vanishes when the neural net estimates b k ( � x t ) → γ t ( k ) as well as it can.

Review Softmax Gaussians Discrete Summary Summary: softmax observation probabilities The Baum-Welch algorithm alternates between two types of estimation, often called the E-step (expectation) and the M-step (maximization or minimization): 1 E-step: Use forward-backward algorithm to re-estimate γ t ( i ) = p ( q t = i | X , Λ). 2 M-step: Train the neural net for a few iterations of gradient descent, so that b k ( � x t ) → γ t ( k ).

Review Softmax Gaussians Discrete Summary Final note: Those ridiculous assumptions As a final note, let’s see if we can eliminate those ridiculous assumptions, p ( q ) = p ( � x ) = 1 / N . How? Well, the weight gradient goes to zero when � T t =1 h t [ j ] ( γ t ( k ) − b k ( � x t )) = 0. There are at least two ways in which this can happen: 1 b k ( � x t ) = γ t ( k ). The neural net is successfully estimating the posterior. This is the best possible solution if x ) = 1 p ( q = i ) = p ( � N . 2 b k ( � x t ) − γ t ( k ) is uncorrelated with h t [ j ], e.g., because it is zero mean and independent of � x t .

Review Softmax Gaussians Discrete Summary Final note: Those ridiculous assumptions The weight gradient goes to zero if γ t ( k ) − b k ( � x t ) is zero mean and independent of � x t . For example, b k ( � x ) might differ from γ t ( k ) by a global scale factor. Instead of softmax, we might use some other normalization, either because (a) it’s scaled more like a likelihood, or (b) it has nice numerical properties. An example of (b) is: exp( e [ i ]) b i ( � x ) = max j exp( e [ j ]) b k ( � x ) might differ from γ t ( k ) by a phone-dependent scale factor, e.g., we might choose x ) = p ( q = i | � x ) exp( e [ i ]) b i ( � = p ( q = i ) p ( q = i ) � N j =1 exp( e [ j ])

Review Softmax Gaussians Discrete Summary Baum-Welch with Gaussian Probabilities Baum-Welch asks us to minimize the cross-entropy between γ t ( i ) and b i ( � x t ): T N L CE = − 1 � � γ t ( i ) ln b i ( � x t ) T t =1 i =1 In order to force b i ( � x t ) to be a likelihood, rather than a posterior, one way is to use a function that is guaranteed to be a properly normalized pdf. For example, a Gaussian: b i ( � x ) = N ( � x ; � µ i , Σ i )

Review Softmax Gaussians Discrete Summary Diagonal-Covariance Gaussian pdf Let’s assume the feature vector has D dimensions, � x = [ x 1 , . . . , x D ]. The Gaussian pdf is 1 (2 π ) D / 2 | Σ | 1 / 2 e − 1 µ )Σ − 1 ( � µ ) T 2 ( � x − � x − � N ( � x ; � µ, Σ) = Let’s assume a diagonal covariance matrix, Σ = diag( σ 2 1 , . . . , σ 2 D ), so that ( xd − µ d )2 − 1 � D 1 d =1 σ 2 2 N ( � x ; � µ, Σ) = e d �� D d =1 2 πσ 2 d

Lecture 13: How to train Observation Probability Densities Mark - PowerPoint PPT Presentation

Review Softmax Gaussians Discrete Summary Lecture 13: How to train Observation Probability Densities Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. ECE 417: Multimedia Signal Processing, Fall 2020 Review Softmax

23 Advanced Topics 5: Multi-lingual Models Up until now, we have assumed that in the case of

TOS Arno Puder 1 Objectives Introduce the train simulator Using the model train

Formal Modeling in Cognitive Science 1 Special Probability Distributions Uniform Distribution

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

A-train Commuter Rail Updated July 31, 2018 Presentation Overview DCTA A-train Commuter Rail

Chest X-rays Basic to Intermediate Interpretation Relative Densities The images seen on a chest

Chest X-ray Path correlation Normal structures Densities Genesis of abnormal

Counting and Probability Whats to come? Counting and Probability Whats to come?

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Qatar observation stations Qatar observation stations, Instruments and calibrations By By

Bethesda Big Train Partnership Presentation What is Big Train? Bethesda Big Train is a summer

Antwerp 50 by train Ghent 40 by

Linked Cluster Expansions for the Functional Renormalization Group Rudrajit (Rudi) Banerjee (In

Towards an Effective Collaboration between Industry and Academia Alessandro Di Bucchianico

In-place Longest Common Extensions Nicola Prezza University of Udine, department of Computer

Time-Space Trade-Offs for Longest Common Extensions Philip Bille 1 , Inge Li Grtz 1 , Benjamin

Local Properties of Graphs and the Hamilton Cycle Problem Johan de Wet 1 , 2 and Marietjie Frick 1

2021 Integrated Resource Plan (IRP) Technical Advisory Committee (TAC) Meeting #4 July 22, 2020

Limit cycles and update schedules in Boolean networks: Inverse Problem. (Results of Luis G

Today Digital signal processors VLIW SHARC details Quick look at audio processing