pre midsem revision
play

Pre-midsem Revision Lecture 11 CS 753 Instructor: Preethi Jyothi - PowerPoint PPT Presentation

Pre-midsem Revision Lecture 11 CS 753 Instructor: Preethi Jyothi Tied-state Triphone Models State Tying Observation probabilities are shared across triphone states which generate acoustically similar data b/a/k p/a/k b/a/g Triphone


  1. Pre-midsem Revision Lecture 11 CS 753 Instructor: Preethi Jyothi

  2. Tied-state Triphone Models

  3. State Tying Observation probabilities are shared across triphone states • which generate acoustically similar data b/a/k p/a/k b/a/g Triphone HMMs (No sharing) b/a/k p/a/k b/a/g Triphone HMMs (State Tying)

  4. Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum-Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

  5. Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum-Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

  6. Tied state HMMs: Step 2 Clone these monophone distributions to initialise a set of untied triphone models Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

  7. Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum-Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

  8. Tied state HMMs: Step 3 Use decision trees to determine which states should be tied together Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

  9. Example: Phonetic Decision Tree (DT) One tree is constructed for each state of each monophone to cluster all the 
 corresponding triphone states DT for center 
 ow2 state of [ow] Head node Uses all training data 
 aa 2 / ox 2 /f 2 , aa 2 / ox 2 /s 2 , 
 tagged with *-ow 2 +* aa 2 / ox 2 /d 2 , h 2 / ox 2 /p 2 , aa 2 / ox 2 /n 2 , aa 2 / ox 2 /g 2 , …

  10. Training data for DT nodes Align training instance x = ( x 1 , …, x T ) where x i ∈ ℝ d with a set • of triphone HMMs Use Viterbi algorithm to find the best HMM triphone state • sequence corresponding to each x Tag each x t with ID of current phone along with left-context • and right-context x t { { { sil-b+aa b-aa+g aa-g+sil x t is tagged with ID b 2 -aa 2 +g 2 i.e. x t is aligned with the second state of the 3-state HMM corresponding to the triphone b-aa+g Training data corresponding to state j in phone p: Gather all • x t ’s that are tagged with ID *- p j +*

  11. Example: Phonetic Decision Tree (DT) One tree is constructed for each state of each monophone to cluster all the 
 corresponding triphone states DT for center 
 ow1 ow2 Ow3 state of [ow] Head node Uses all training data 
 aa 2 / ox 2 /f 2 , aa 2 / ox 2 /s 2 , 
 tagged as *-ow 2 +* aa 2 / ox 2 /d 2 , h 2 / ox 2 /p 2 , aa 2 / ox 2 /n 2 , aa 2 / ox 2 /g 2 , Is left ctxt a vowel? … Yes No Is right ctxt a Is right ctxt nasal? fricative? Yes No Yes No Is right ctxt a Leaf E Leaf A Leaf B glide? aa 2 / ox 2 /n 2 , 
 aa 2 / ox 2 /f 2 , 
 aa 2 / ox 2 /d 2 , 
 aa 2 / ox 2 /m 2 , aa 2 / ox 2 /s 2 , Yes No aa 2 / ox 2 /g 2 , … … … Leaf C Leaf D h 2 / ox 2 /l 2 , 
 h 2 / ox 2 /p 2 , 
 b 2 / ox 2 /r 2 , b 2 / ox 2 /k 2 , … …

  12. 
 
 
 How do we build these phone DTs? 1. What questions are used? 
 Linguistically-inspired binary questions: “Does the left or right phone come from a broad class of phones such as vowels, stops, etc.?” “Is the left or right phone [k] or [m]?” 2. What is the training data for each phone state, p j ? (root node of DT) 
 All speech frames that align with the j th state of every triphone HMM that has p as the middle phone 3. What criterion is used at each node to find the best question to split the data on? 
 Find the question which partitions the states in the parent node so as to give the maximum increase in log likelihood

  13. Likelihood of a cluster of states If a cluster of HMM states, S = {s 1 , s 2 , …, s M } consists of M states • and a total of K acoustic observation vectors are associated with S, { x 1 , x 2 …, x K } , then the log likelihood associated with S is: K X X L ( S ) = log Pr( x i ; µ S , Σ S ) γ s ( x i ) i =1 s ∈ S For a question q that splits S into S yes and S no , compute the • following quantity: ∆ q = L ( S q yes ) + L ( S q no ) − L ( S ) Go through all questions, find Δ q for each question q and choose • the question for which Δ q is the biggest Terminate when: Final Δ q is below a threshold or data associated • with a split falls below a threshold

  14. Tied state HMMs Four main steps in building a tied state HMM system: 1. Create and train 3-state monophone HMMs with single Gaussian observation probability densities 2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum-Welch estimation. Transition matrix remains common across all triphones of each phone. 3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together. 4. Number of mixture components in each tied state is increased and models re-estimated using BW Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

  15. WFSTs for ASR

  16. WFST-based ASR System Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Model Model Acoustic 
 Word 
 Triphones Monophones Words Indices Sequence

  17. WFST-based ASR System Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Model Model Acoustic 
 Word 
 Triphones Monophones Words Indices Sequence H a-a+b f 4 : ε f 1 : ε f 3 : ε f 0 : a-a+b ε f 2 : ε f 4 : ε f 6 : ε } a-b+b FST Union + One 3-state 
 Closure HMM for 
 . Resulting each 
 FST . triphone H . y-x+z

  18. WFST-based ASR System Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Model Model Acoustic 
 Word 
 Triphones Monophones Words Indices Sequence C . . b-c+x:b cx a-b+c:a � : b � : c ϵ ϵ o bc c ab b-c+a:b ca . .

  19. WFST-based ASR System Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Model Model Acoustic 
 Word 
 Triphones Monophones Words Indices Sequence L (a) t: ε /0.3 ax: ε /1 ey: ε /0.5 2 3 4 dx: ε /0.7 ae: ε /0.5 d:data/1 1 0 d:dew/1 uw: ε /1 5 6 (b) Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

  20. WFST-based ASR System Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Model Model Acoustic 
 Word 
 Triphones Monophones Words Indices Sequence G are/0.693 walking birds/0.404 the 0 were/0.693 animals/1.789 is boy/1.789

  21. Decoding Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Model Model Acoustic 
 Word 
 Triphones Monophones Words Indices Sequence H C L G Carefully construct a decoding graph D using optimization algorithms: D = min(det(H ⚬ det(C ⚬ det(L ⚬ G)))) Given a test utterance O, how do I decode it? 
 Assuming ample compute, first construct the following machine X from O. f 0 :19.12 f 0 :18.52 f 0 :10.578 f 0 :9.21 If f i maps to state j, 
 f 1 :12.33 this is -log(b j (O i )) f i maps to a distinct f 1 :13.45 f 1 :5.645 f 1 :14.221 triphone HMM state ⠇ ⠇ ⠇ ⠇ ………… f 500 :20.21 f 500 :10.21 f 500 :8.123 f 500 :11.233 f 1000 :11.11 f 1000 :15.99 f 1000 :5.678 f 1000 :15.638 “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Recommend


More recommend