Pre-midsem Revision Lecture 11 CS 753 Instructor: Preethi Jyothi
Tied-state Triphone Models
State Tying Observation probabilities are shared across triphone states • which generate acoustically similar data b/a/k p/a/k b/a/g Triphone HMMs (No sharing) b/a/k p/a/k b/a/g Triphone HMMs (State Tying)
Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum-Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum-Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Tied state HMMs: Step 2 Clone these monophone distributions to initialise a set of untied triphone models Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum-Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Tied state HMMs: Step 3 Use decision trees to determine which states should be tied together Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Example: Phonetic Decision Tree (DT) One tree is constructed for each state of each monophone to cluster all the corresponding triphone states DT for center ow2 state of [ow] Head node Uses all training data aa 2 / ox 2 /f 2 , aa 2 / ox 2 /s 2 , tagged with *-ow 2 +* aa 2 / ox 2 /d 2 , h 2 / ox 2 /p 2 , aa 2 / ox 2 /n 2 , aa 2 / ox 2 /g 2 , …
Training data for DT nodes Align training instance x = ( x 1 , …, x T ) where x i ∈ ℝ d with a set • of triphone HMMs Use Viterbi algorithm to find the best HMM triphone state • sequence corresponding to each x Tag each x t with ID of current phone along with left-context • and right-context x t { { { sil-b+aa b-aa+g aa-g+sil x t is tagged with ID b 2 -aa 2 +g 2 i.e. x t is aligned with the second state of the 3-state HMM corresponding to the triphone b-aa+g Training data corresponding to state j in phone p: Gather all • x t ’s that are tagged with ID *- p j +*
Example: Phonetic Decision Tree (DT) One tree is constructed for each state of each monophone to cluster all the corresponding triphone states DT for center ow1 ow2 Ow3 state of [ow] Head node Uses all training data aa 2 / ox 2 /f 2 , aa 2 / ox 2 /s 2 , tagged as *-ow 2 +* aa 2 / ox 2 /d 2 , h 2 / ox 2 /p 2 , aa 2 / ox 2 /n 2 , aa 2 / ox 2 /g 2 , Is left ctxt a vowel? … Yes No Is right ctxt a Is right ctxt nasal? fricative? Yes No Yes No Is right ctxt a Leaf E Leaf A Leaf B glide? aa 2 / ox 2 /n 2 , aa 2 / ox 2 /f 2 , aa 2 / ox 2 /d 2 , aa 2 / ox 2 /m 2 , aa 2 / ox 2 /s 2 , Yes No aa 2 / ox 2 /g 2 , … … … Leaf C Leaf D h 2 / ox 2 /l 2 , h 2 / ox 2 /p 2 , b 2 / ox 2 /r 2 , b 2 / ox 2 /k 2 , … …
How do we build these phone DTs? 1. What questions are used? Linguistically-inspired binary questions: “Does the left or right phone come from a broad class of phones such as vowels, stops, etc.?” “Is the left or right phone [k] or [m]?” 2. What is the training data for each phone state, p j ? (root node of DT) All speech frames that align with the j th state of every triphone HMM that has p as the middle phone 3. What criterion is used at each node to find the best question to split the data on? Find the question which partitions the states in the parent node so as to give the maximum increase in log likelihood
Likelihood of a cluster of states If a cluster of HMM states, S = {s 1 , s 2 , …, s M } consists of M states • and a total of K acoustic observation vectors are associated with S, { x 1 , x 2 …, x K } , then the log likelihood associated with S is: K X X L ( S ) = log Pr( x i ; µ S , Σ S ) γ s ( x i ) i =1 s ∈ S For a question q that splits S into S yes and S no , compute the • following quantity: ∆ q = L ( S q yes ) + L ( S q no ) − L ( S ) Go through all questions, find Δ q for each question q and choose • the question for which Δ q is the biggest Terminate when: Final Δ q is below a threshold or data associated • with a split falls below a threshold
Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum-Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
WFSTs for ASR
WFST-based ASR System Acoustic Context Pronunciation Language Models Transducer Model Model Acoustic Word Triphones Monophones Words Indices Sequence
WFST-based ASR System Acoustic Context Pronunciation Language Models Transducer Model Model Acoustic Word Triphones Monophones Words Indices Sequence H a-a+b f 4 : ε f 1 : ε f 3 : ε f 0 : a-a+b ε f 2 : ε f 4 : ε f 6 : ε } a-b+b FST Union + One 3-state Closure HMM for . Resulting each FST . triphone H . y-x+z
WFST-based ASR System Acoustic Context Pronunciation Language Models Transducer Model Model Acoustic Word Triphones Monophones Words Indices Sequence C . . b-c+x:b cx a-b+c:a � : b � : c ϵ ϵ o bc c ab b-c+a:b ca . .
WFST-based ASR System Acoustic Context Pronunciation Language Models Transducer Model Model Acoustic Word Triphones Monophones Words Indices Sequence L (a) t: ε /0.3 ax: ε /1 ey: ε /0.5 2 3 4 dx: ε /0.7 ae: ε /0.5 d:data/1 1 0 d:dew/1 uw: ε /1 5 6 (b) Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002
WFST-based ASR System Acoustic Context Pronunciation Language Models Transducer Model Model Acoustic Word Triphones Monophones Words Indices Sequence G are/0.693 walking birds/0.404 the 0 were/0.693 animals/1.789 is boy/1.789
Decoding Acoustic Context Pronunciation Language Models Transducer Model Model Acoustic Word Triphones Monophones Words Indices Sequence H C L G Carefully construct a decoding graph D using optimization algorithms: D = min(det(H ⚬ det(C ⚬ det(L ⚬ G)))) Given a test utterance O, how do I decode it? Assuming ample compute, first construct the following machine X from O. f 0 :19.12 f 0 :18.52 f 0 :10.578 f 0 :9.21 If f i maps to state j, f 1 :12.33 this is -log(b j (O i )) f i maps to a distinct f 1 :13.45 f 1 :5.645 f 1 :14.221 triphone HMM state ⠇ ⠇ ⠇ ⠇ ………… f 500 :20.21 f 500 :10.21 f 500 :8.123 f 500 :11.233 f 1000 :11.11 f 1000 :15.99 f 1000 :5.678 f 1000 :15.638 “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002
Recommend
More recommend