Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 7: Hidden Markov Models (Part III) Instructor: Preethi Jyothi Aug 14, 2017
Recap: Learning HMM Parameters Given an HMM λ = ( A , B ) and an observation se- Problem 1 (Likelihood): quence O , determine the likelihood P ( O | λ ) . Given an observation sequence O and an HMM λ = Problem 2 (Decoding): ( A , B ) , discover the best hidden state sequence Q . Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B . Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B . Standard algorithm for HMM training: Forward-backward or Baum-Welch algorithm
Baum Welch: In summary [Every EM Iteration] Compute θ = { A jk , ( µ jm , Σ jm, c jm ) } for all j,k,m P N P T i t =2 ξ i,t ( j, k ) i =1 A j,k = P N P T i P k 0 ξ i,t ( j, k 0 ) i =1 t =2 P N P T i t =1 γ i,t ( j, m ) x it i =1 µ jm = P N P T i t =1 γ i,t ( j, m ) i =1 P N P T i t =1 γ i,t ( j, m )( x it − µ jm )( x it − µ jm ) T i =1 Σ jm = P N P T i t =1 γ i,t ( j, m ) i =1 P N P T i t =1 γ i,t ( j, m ) i =1 c jm = P N P T i t =1 γ i,t ( j ) i =1 How do we e ff iciently compute 𝛿 t ( j ) and ξ t ( i, j ) ?
Forward/Backward Probabilities Require two probabilities to compute estimates for the transition and observation probabilities 1. Forward probability: Recall α t ( j ) = P ( o 1 , o 2 ... o t , q t = j | λ ) 2. Backward probability: β t ( i ) = P ( o t + 1 , o t + 2 ... o T | q t = i , λ )
Backward probability 1. Initialization: β T ( i ) = a iF , 1 ≤ i ≤ N 2. Recursion (again since states 0 and q F are non-emitting): N X β t ( i ) = a ij b j ( o t + 1 ) β t + 1 ( j ) , 1 ≤ i ≤ N , 1 ≤ t < T j = 1 3. Termination: N X P ( O | λ ) = α T ( q F ) = β 1 ( q 0 ) = a 0 j b j ( o 1 ) β 1 ( j ) j = 1
1. Baum-Welch: Estimating a ij ξ t ( i, j ) We need to define to estimate a ij where ξ t ( i , j ) = P ( q t = i , q t + 1 = j | O , λ ) ξ t ( i , j ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) first compute a probability which is similar which works out to be α T ( q F ) P T − 1 t = 1 ξ t ( i , j ) Then, a ij = ˆ P T − 1 P N k = 1 ξ t ( i , k ) t = 1 si sj a ij b j (o t+1 ) α t (i) β t+1 (j) ot-1 ot ot+1 ot+2
2. Baum-Welch: Estimating b i (o t ) γ t ( j ) We need to define to estimate b i (o t ) where γ t ( j ) = P ( q t = j | O , λ ) γ t ( j ) = α t ( j ) β t ( j ) which works out to be P ( O | λ ) P T t = 1 s . t . O t = v k γ t ( j ) ˆ Then, for discrete outputs b j ( v k ) = P T t = 1 γ t ( j ) in Eq. 9.38 and Eq. 9.43 to re-estimate sj α t (j) β t (j) ot-1 ot ot+1
Baum-Welch algorithm (pseudocode) function F ORWARD -B ACKWARD ( observations of len T , output vocabulary V , hidden state set Q ) returns HMM=(A,B) initialize A and B iterate until convergence E-step γ t ( j ) = α t ( j ) β t ( j ) ∀ t and j α T ( q F ) ξ t ( i , j ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) ∀ t , i , and j α T ( q F ) M-step T − 1 X ξ t ( i , j ) t = 1 a i j = ˆ T − 1 N X X ξ t ( i , k ) t = 1 k = 1 T X γ t ( j ) t = 1 s . t . O t = v k ˆ b j ( v k ) = T X γ t ( j ) t = 1 return A , B
ASR Framework: Acoustic Models Acoustic Context Pronunciation Language Models Transducer Monophones Model Model Acoustic Word Triphones Words Features Sequence H Acoustic models are estimated using training data: { x i , y i }, i=1…N • where x i corresponds to a sequence of acoustic feature vectors and y i corresponds to a sequence of words For each x i , y i , a composite HMM is constructed using the HMMs that • correspond to the triphone sequence in y i “Hello world” “sil hh ah l ow w er l d sil” “sil sil/hh/ah hh/ah/l ah/l/ow l/ow/w er/w/l l/er/d er/l/d l/d/sil sil”
ASR Framework: Acoustic Models Acoustic Context Pronunciation Language Models Transducer Monophones Model Model Acoustic Word Triphones Words Features Sequence H Acoustic models are estimated using training data: { x i , y i }, i=1…N • where x i corresponds to a sequence of acoustic feature vectors and y i corresponds to a sequence of words For each x i , y i , a composite HMM is constructed using the HMMs that • correspond to the triphone sequence in y i Parameters of these composite HMMs are the parameters of the • constituent triphone HMMs. These parameters are fit to the acoustic data { x i }, i=1…N using the • Baum-Welch algorithm ( EM )
Triphone HMM Models Each phone is modelled in the context of its le fu and right neighbour • phones Pronunciation of a phone is influenced by the preceding and • succeeding phones. E.g. The phone [p] in the word “ peek ” : p iy k” vs. [p] in the word “ pool ” : p uw l Number of triphones that appear in data ≈ 1000s or 10,000s • If each triphone HMM has 3 states and each state generates m- • component GMMs ( m ≈ 64), for d -dimensional acoustic feature vectors ( d ≈ 40) with Σ having d 2 parameters Hundreds of millions of parameters! • Insu ff icient data to learn all triphone models reliably. What do we do? • Share parameters across triphone models!
Parameter Sharing Sharing of parameters (also referred to as “parameter tying”) can be • done at any level: Parameters in HMMs corresponding to two triphones are said to be • tied if they are identical Transition probs are tied i.e. t ’ i = t i t ’ 1 t ’ 3 t ’ 5 t 1 t 3 t 5 t ’ 2 t ’ 4 t 2 t 4 State observation densities are tied More parameter tying: Tying variances of all Gaussians within a state, • tying variances of all Gaussians in all states, tying individual Gaussians, etc.
1. Tied Mixture Models All states share the same Gaussians (i.e. same means and • covariances) Mixture weights are specific to each state • Triphone HMMs (No sharing) Triphone HMMs (Tied Mixture Models)
2. State Tying Observation probabilities are shared across states which • generate acoustically similar data b/a/k p/a/k b/a/g Triphone HMMs (No sharing) b/a/k p/a/k b/a/g Triphone HMMs (State Tying)
Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum- Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum- Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in Which states should be tied each tied state is increased and together? Use decision trees. models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Decision Trees Classification using a decision tree: Begins at the root node: What property is satisfied? Depending on answer, traverse to di ff erent branches Shape? Leafy Cylindrical Oval Spinach Color? Green Taste? White Snakeg ov rd Neutral Sour T us nip Color? T on ato White Purple Radish Brinjal
Decision Trees Given the data at a node, either declare the node to be a leaf • or find another property to split the node into branches. Important questions to be addressed for DTs: • 1. How many splits at a node? Chosen by the user. 2. Which property should be used at a node for spli tu ing? One which decreases “impurity” of nodes as much as possible. 3. When is a node a lea f ? Set threshold in reduction in impurity
Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum- Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in Which states should be tied each tied state is increased and together? Use decision trees. models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Recommend
More recommend