Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 7: Hidden Markov Models (Part III) Instructor: Preethi Jyothi Aug 14, 2017  

Recap: Learning HMM Parameters Given an HMM λ = ( A , B ) and an observation se- Problem 1 (Likelihood): quence O , determine the likelihood P ( O | λ ) . Given an observation sequence O and an HMM λ = Problem 2 (Decoding): ( A , B ) , discover the best hidden state sequence Q . Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B . Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B . Standard algorithm for HMM training: Forward-backward or   Baum-Welch algorithm

Baum Welch: In summary [Every EM Iteration]   Compute θ = { A jk , ( µ jm , Σ jm, c jm ) } for all j,k,m P N P T i t =2 ξ i,t ( j, k ) i =1 A j,k = P N P T i P k 0 ξ i,t ( j, k 0 ) i =1 t =2 P N P T i t =1 γ i,t ( j, m ) x it i =1 µ jm = P N P T i t =1 γ i,t ( j, m ) i =1 P N P T i t =1 γ i,t ( j, m )( x it − µ jm )( x it − µ jm ) T i =1 Σ jm = P N P T i t =1 γ i,t ( j, m ) i =1 P N P T i t =1 γ i,t ( j, m ) i =1 c jm = P N P T i t =1 γ i,t ( j ) i =1 How do we e ff iciently compute 𝛿 t ( j ) and ξ t ( i, j ) ?

Forward/Backward Probabilities Require two probabilities to compute estimates for the transition and observation probabilities 1. Forward probability: Recall α t ( j ) = P ( o 1 , o 2 ... o t , q t = j | λ ) 2. Backward probability: β t ( i ) = P ( o t + 1 , o t + 2 ... o T | q t = i , λ )

Backward probability 1. Initialization: β T ( i ) = a iF , 1 ≤ i ≤ N 2. Recursion (again since states 0 and q F are non-emitting): N X β t ( i ) = a ij b j ( o t + 1 ) β t + 1 ( j ) , 1 ≤ i ≤ N , 1 ≤ t < T j = 1 3. Termination: N X P ( O | λ ) = α T ( q F ) = β 1 ( q 0 ) = a 0 j b j ( o 1 ) β 1 ( j ) j = 1

1. Baum-Welch: Estimating a ij ξ t ( i, j ) We need to define to estimate a ij where ξ t ( i , j ) = P ( q t = i , q t + 1 = j | O , λ ) ξ t ( i , j ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) first compute a probability which is similar which works out to be α T ( q F ) P T − 1 t = 1 ξ t ( i , j ) Then, a ij = ˆ P T − 1 P N k = 1 ξ t ( i , k ) t = 1 si sj a ij b j (o t+1 ) α t (i) β t+1 (j) ot-1 ot ot+1 ot+2

2. Baum-Welch: Estimating b i (o t ) γ t ( j ) We need to define to estimate b i (o t ) where γ t ( j ) = P ( q t = j | O , λ ) γ t ( j ) = α t ( j ) β t ( j ) which works out to be P ( O | λ ) P T t = 1 s . t . O t = v k γ t ( j ) ˆ Then, for discrete outputs b j ( v k ) = P T t = 1 γ t ( j ) in Eq. 9.38 and Eq. 9.43 to re-estimate sj α t (j) β t (j) ot-1 ot ot+1

Baum-Welch algorithm (pseudocode) function F ORWARD -B ACKWARD ( observations of len T , output vocabulary V , hidden state set Q ) returns HMM=(A,B) initialize A and B iterate until convergence E-step γ t ( j ) = α t ( j ) β t ( j ) ∀ t and j α T ( q F ) ξ t ( i , j ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) ∀ t , i , and j α T ( q F ) M-step T − 1 X ξ t ( i , j ) t = 1 a i j = ˆ T − 1 N X X ξ t ( i , k ) t = 1 k = 1 T X γ t ( j ) t = 1 s . t . O t = v k ˆ b j ( v k ) = T X γ t ( j ) t = 1 return A , B

ASR Framework: Acoustic Models Acoustic   Context   Pronunciation   Language   Models Transducer Monophones Model Model Acoustic   Word   Triphones Words Features Sequence H Acoustic models are estimated using training data: { x i , y i }, i=1…N   • where x i corresponds to a sequence of acoustic feature vectors   and y i corresponds to a sequence of words For each x i , y i , a composite HMM is constructed using the HMMs that • correspond to the triphone sequence in y i “Hello world” “sil hh ah l ow w er l d sil” “sil sil/hh/ah hh/ah/l ah/l/ow l/ow/w er/w/l l/er/d er/l/d l/d/sil sil”

ASR Framework: Acoustic Models Acoustic   Context   Pronunciation   Language   Models Transducer Monophones Model Model Acoustic   Word   Triphones Words Features Sequence H Acoustic models are estimated using training data: { x i , y i }, i=1…N   • where x i corresponds to a sequence of acoustic feature vectors   and y i corresponds to a sequence of words For each x i , y i , a composite HMM is constructed using the HMMs that • correspond to the triphone sequence in y i Parameters of these composite HMMs are the parameters of the   • constituent triphone HMMs. These parameters are fit to the acoustic data { x i }, i=1…N using the • Baum-Welch algorithm ( EM )

Triphone HMM Models Each phone is modelled in the context of its le fu and right neighbour • phones Pronunciation of a phone is influenced by the preceding and • succeeding phones. E.g. The phone [p] in the word “ peek ” : p iy k” vs. [p] in the word “ pool ” : p uw l Number of triphones that appear in data ≈ 1000s or 10,000s • If each triphone HMM has 3 states and each state generates m- • component GMMs ( m ≈ 64), for d -dimensional acoustic feature vectors ( d ≈ 40) with Σ having d 2 parameters Hundreds of millions of parameters!   • Insu ff icient data to learn all triphone models reliably. What do we do? • Share parameters across triphone models!

Parameter Sharing Sharing of parameters (also referred to as “parameter tying”) can be • done at any level: Parameters in HMMs corresponding to two triphones are said to be • tied if they are identical Transition probs   are tied i.e. t ’ i = t i t ’ 1 t ’ 3 t ’ 5 t 1 t 3 t 5 t ’ 2 t ’ 4 t 2 t 4 State observation densities   are tied More parameter tying: Tying variances of all Gaussians within a state,   • tying variances of all Gaussians in all states, tying individual Gaussians, etc.

1. Tied Mixture Models All states share the same Gaussians (i.e. same means and • covariances) Mixture weights are specific to each state • Triphone HMMs (No sharing) Triphone HMMs (Tied Mixture Models)

2. State Tying Observation probabilities are shared across states which • generate acoustically similar data b/a/k p/a/k b/a/g Triphone HMMs (No sharing) b/a/k p/a/k b/a/g Triphone HMMs (State Tying)

Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum- Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum- Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in Which states should be tied each tied state is increased and together? Use decision trees. models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

Decision Trees Classification using a decision tree: Begins at the root node: What property is satisfied? Depending on answer, traverse to di ff erent branches Shape? Leafy Cylindrical Oval Spinach Color? Green Taste? White Snakeg ov rd Neutral Sour T us nip Color? T on ato White Purple Radish Brinjal

Decision Trees Given the data at a node, either declare the node to be a leaf • or find another property to split the node into branches. Important questions to be addressed for DTs: • 1. How many splits at a node? Chosen by the user. 2. Which property should be used at a node for spli tu ing? One which decreases “impurity” of nodes as much as possible. 3. When is a node a lea f ? Set threshold in reduction in impurity

Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum- Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in Which states should be tied each tied state is increased and together? Use decision trees. models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 7: Hidden Markov Models (Part III) Instructor: Preethi Jyothi Aug 14, 2017 Recap: Learning HMM Parameters Given an HMM = ( A , B ) and an observation se-

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Advanced Algorithms (XIV) Shanghai Jiao Tong University Chihao Zhang June 8, 2020 Mixing Time

Facebook Friends and Matrix Functions Graduate Research Day Joint with Kyle Kloster David

Enterprise IPv6 Transition Matrix IETF 60 IPv6 Operations Working Group Aug 2-6, 2004 San

Transition Domains Alignment to Nebraska Agency Supports and ESU 13 Transition XXXXXX High

MSc in Computer Engineering, Cybersecurity and Artificial Intelligence, Fault Diagnosis and

AE3M33MKR Kalman Filter Ing. Karel Ko snar PhD., RNDr. Miroslav Kulich, Ph.D., Dr. el

STA 331 2.0 Stochastic Processes 2. Markov Chains Dr Thiyanga S. Talagala August 4, 2020

Foundations of Chemical Kinetics Lecture 20: The master equation Marc R. Roussel Department of