HMMs for Acoustic Modeling (Part II) Lecture 3 CS 753 Instructor: Preethi Jyothi
Recap: HMMs for Acoustic Modeling What are (first-order) HMMs? What are the simplifying assumptions governing HMMs? What are the three fundamental problems related to HMMs? 1. What is the forward algorithm? What is it used to compute? Computing Likelihood: Given an HMM λ = ( A , B ) and an observa- tion sequence O , determine the likelihood P ( O | λ ) . 2. What is the Viterbi algorithm? What is it used to compute? Decoding : Given as input an HMM λ = ( A , B ) and a sequence of ob- servations O = o 1 , o 2 ,..., o T , find the most probable sequence of states Q = q 1 q 2 q 3 ... q T .
Problem 3: Learning in HMMs Given an HMM λ = ( A , B ) and an observation se- Problem 1 (Likelihood): quence O , determine the likelihood P ( O | λ ) . Given an observation sequence O and an HMM λ = Problem 2 (Decoding): ( A , B ) , discover the best hidden state sequence Q . Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B . Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B . Standard algorithm for HMM training: Forward-backward or Baum-Welch algorithm
Forward and Backward Probabilities Baum-Welch algorithm iteratively estimates transition & observation probabilities and uses these values to derive even better estimates. Require two probabilities to compute estimates for the transition and observation probabilities: 1. Forward probability: Recall α t ( j ) = P ( o 1 , o 2 ... o t , q t = j | λ ) 2. Backward probability: β t ( i ) = P ( o t + 1 , o t + 2 ... o T | q t = i , λ )
Backward probability 1. Initialization: β T ( i ) = 1 , 1 ≤ i ≤ N 2. Recursion N X β t ( i ) = a ij b j ( o t + 1 ) β t + 1 ( j ) , 1 ≤ i ≤ N , 1 ≤ t < T j = 1 3. Termination: N X P ( O | λ ) = π j b j ( o 1 ) β 1 ( j ) j = 1
Visualising backward probability computation β t+1 (N) β t (i)= Σ j β t+1 (j) a ij b j (o t+1 ) q N q N a iN q i β t+1 (3) a i3 q 3 q 3 b N (o t+1 ) a i2 β t+1 (2) b 3 (o t+1 ) a i1 q 2 q 2 b 2 (o t+1 ) q 2 β t+1 (1) b 1 (o t+1 ) q 1 q 1 q 1 ot o t-1 ot+1 The computation of β i by summing all the successive values β Figure A.11 j
1. Baum-Welch: Estimating ! a ij ξ t ( i, j ) We need to define to estimate a ij where ξ t ( i , j ) = P ( q t = i , q t + 1 = j | O , λ ) ξ t ( i , j ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) first compute a probability which is similar which works out to be P N j = 1 α t ( j ) β t ( j ) P T − 1 t = 1 ξ t ( i , j ) a ij = Then, ˆ P T − 1 P N k = 1 ξ t ( i , k ) t = 1 si sj a ij b j (o t+1 ) α t (i) β t+1 (j) ot-1 ot ot+1 ot+2
2. Baum-Welch: Estimating ! b j ( v k ) γ t ( j ) We need to define to estimate b j (v k ) where γ t ( j ) = P ( q t = j | O , λ ) State occupancy γ t ( j ) = α t ( j ) β t ( j ) probability which works out to be P ( O | λ ) P T t = 1 s . t . O t = v k γ t ( j ) ˆ Then, for discrete outputs b j ( v k ) = P T t = 1 γ t ( j ) in Eq. 9.38 and Eq. 9.43 to re-estimate sj α t (j) β t (j) ot-1 ot ot+1
Bringing it all together: Baum-Welch Estimating HMM parameters iteratively using the EM algorithm. For each iteration, do: E step: For all time-state pairs, compute the state occupation probabilities 훾 t (j) and ξ t (i, j) M step: Reestimate HMM parameters, i.e. transition probabilities, observation probabilities, based on the estimates derived in the E step
Baum-Welch algorithm (pseudocode) function F ORWARD -B ACKWARD ( observations of len T , output vocabulary V , hidden state set Q ) returns HMM=(A,B) initialize A and B iterate until convergence E-step γ t ( j ) = α t ( j ) β t ( j ) ∀ t and j α T ( q F ) ξ t ( i , j ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) ∀ t , i , and j α T ( q F ) M-step T − 1 X ξ t ( i , j ) t = 1 a ij = ˆ T − 1 N X X ξ t ( i , k ) t = 1 k = 1 T X γ t ( j ) t = 1 s . t . O t = v k ˆ b j ( v k ) = T X γ t ( j ) t = 1 return A , B
Discrete to continuous outputs We derived Baum-Welch updates for discrete outputs. However, HMMs in acoustic models emit real-valued vectors as observations. Before we understand how Baum-Welch works for acoustic modelling using HMMs, let’s look at an overview of the Expectation Maximization ( EM ) algorithm and establish some notation.
EM Algorithm: Fitting Parameters to Data Observed data: i.i.d samples x i , i =1, …, N N Goal: Find where X arg max L ( θ ) L ( θ ) = log Pr( x i ; θ ) θ i =1 Initial parameters: θ 0 ( x is observed and z is hidden ) Iteratively compute θ l as follows: N X X Q ( θ , θ ` − 1 ) = Pr( z | x i ; θ ` − 1 ) log Pr( x i , z ; θ ) z i =1 θ ` = arg max Q ( θ , θ ` − 1 ) ✓ Estimate θ l cannot get worse over iterations because for all θ : L ( θ ) − L ( θ ` − 1 ) ≥ Q ( θ , θ ` − 1 ) − Q ( θ ` − 1 , θ ` − 1 ) EM is guaranteed to converge to a local optimum or saddle points [Wu83]
Coin example to illustrate EM ������ ������ ������ 휌 1 = Pr ( H ) 휌 2 = Pr ( H ) 휌 3 = Pr ( H ) Repeat: Toss ������� privately if it shows H : Toss ������ twice else Toss ������ twice The following sequence is observed: “ HH, TT, HH, TT, HH ” How do you estimate 휌 1 , 휌 2 and 휌 3 ?
Coin example to illustrate EM Recall, for partially observed data, the log likelihood is given by: N N X X X L ( θ ) = log Pr( x i ; θ ) = log Pr( x i , z ; θ ) z i =1 i =1 where, for the coin example: ∈ X = { HH,HT,TH,TT } • each observation x i ∈ Z = { H,T } • the hidden variable z
Coin example to illustrate EM Recall, for partially observed data, the log likelihood is given by: N N X X X L ( θ ) = log Pr( x i ; θ ) = log Pr( x i , z ; θ ) z i =1 i =1 ������ ������ ������ Pr( x, z ; θ ) = Pr( x | z ; θ ) Pr( z ; θ ) 휌 2 = Pr ( H ) 휌 3 = Pr ( H ) 휌 1 = Pr ( H ) ( if z = H ρ 1 where Pr( z ; θ ) = 1 − ρ 1 if z = T ( ρ h 2 (1 − ρ 2 ) t if z = H Pr( x | z ; θ ) = ρ h 3 (1 − ρ 3 ) t if z = T h : number of heads, t : number of tails
Coin example to illustrate EM Our observed data is: {HH, TT, HH, TT, HH} Let’s use EM to estimate θ = ( 휌 1 , 휌 2 , 휌 3 ) [EM Iteration, E-step] Compute quantities involved in N X X Q ( θ , θ ` − 1 ) = γ ( z, x i ) log Pr( x i , z ; θ ) z i =1 where 훾 ( z , x ) = Pr( z | x ; θ l -1 ) i.e., compute 훾 ( z , x i ) for all z and all i Suppose θ l -1 is 휌 1 = 0.3, 휌 2 = 0.4, 휌 3 = 0.6: What is 훾 ( H, HH )? = 0.16 What is 훾 ( H, TT )? = 0.49
Coin example to illustrate EM Our observed data is: {HH, TT, HH, TT, HH} Let’s use EM to estimate θ = ( 휌 1 , 휌 2 , 휌 3 ) [EM Iteration, M-step] Find θ which maximises N X X Q ( θ , θ ` − 1 ) = γ ( z, x i ) log Pr( x i , z ; θ ) z i =1 P N i =1 γ (H , x i ) ρ 1 = N P N i =1 γ (H , x i ) h i ρ 2 = P N i =1 γ (H , x i )( h i + t i ) P N i =1 γ (T , x i ) h i ρ 3 = P N i =1 γ (T , x i )( h i + t i )
Coin example to illustrate EM 1 This was a very simple HMM 휌 1 H (with observations from 2 states) H/ 휌 2 T/1- 휌 2 State remains the same after the first transition 1- 휌 1 1 T γ estimated the distribution of this state H/ 휌 3 T/1- 휌 3 More generally, will need the distribution of the state at each time step EM for general HMMs: Baum-Welch algorithm (1972) ( predates the general formulation of EM (1977))
Baum-Welch Algorithm as EM Observed data: N sequences, x i , i=1…N where x i ∈ V Parameters θ : transition matrix A, observation probabilities B [EM Iteration, E-step] Compute quantities involved in Q ( θ , θ l -1 ) 훾 i,t ( j ) = Pr( z t = j | x i ; θ l -1 ) 훏 i,t ( j , k ) = Pr( z t = j, z t+1 = k | x i ; θ l -1 )
<latexit sha1_base64="uVFnsJYIYcB5KF/iC5IN0q4U0tA=">ACY3ichVHLSgMxFM2MWrW+xsdOhGARLdQyUwXdFKpuXIlCq0KnDpk0Y2MzD5I7Yh3mJ925c+N/mD4WgUvBM495x5ucuIngiuw7XfDnJmdK8wvLBaXldW16z1jVsVp5KyFo1FLO9opjgEWsB8HuE8lI6At25/cvhvrdM5OKx1ETBgnrhOQx4gGnBDTlWa9nXvZU6e4jt1AEpq5Kg29jNed/OEKjxsYNlnT4dOjt0XruUK5AfaVs7/nx9R/f0p534596ySXbVHhX8DZwJKaFLXnvXmdmOahiwCKohSbcdOoJMRCZwKlhfdVLGE0D5ZG0NIxIy1clGeV4TzNdHMRSnwjwiP3uyEio1CD09WRIoKemtSH5l9ZOITjtZDxKUmARHS8KUoEhxsPAcZdLRkEMNCBUcn1XTHtERw36W4o6BGf6yb/Bba3qHFVrN8elxvkjgW0jXbRAXLQCWqgS3SNWoiD6NgrBmW8WkumRvm1njUNCaeTfSjzJ0vOj62Q=</latexit> Baum-Welch Algorithm as EM Observed data: N sequences, x i , i=1…N where x i ∈ V Parameters θ : transition matrix A, observation probabilities B [EM Iteration, M-step] Find θ which maximises Q ( θ , θ l -1 ) P N P T i � 1 t =1 ξ i,t ( j, k ) i =1 A j,k = P N P T i � 1 P k 0 ξ i,t ( j, k 0 ) i =1 t =1 P N P t : x it = v γ i,t ( j ) i =1 B j,v = P N P T i t =1 γ i,t ( j ) i =1
Recommend
More recommend