automatic speech recognition cs753 automatic speech
play

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 18: Search & Decoding (Part I) Instructor: Preethi Jyothi Mar 23, 2017 Recall ASR Decoding W = arg max Pr( O A | W ) Pr( W ) W " N 8 9 # 2 3


  1. Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 18: Search & Decoding (Part I) Instructor: Preethi Jyothi Mar 23, 2017 


  2. Recall ASR Decoding W ∗ = arg max Pr( O A | W ) Pr( W ) W " N 8 9 # 2 3 T < = W ∗ = arg max Y 4 X Y Pr( w n | w n − 1 Pr( O t | q t , w N 1 ) Pr( q t | q t − 1 , w N n − m +1 ) 1 ) 5 w N 1 ,N : ; n =1 t =1 q T 1 ,w N 1 (" N # " T #) Viterbi Y Y Pr( w n | w n − 1 Pr( O t | q t , w N 1 ) Pr( q t | q t − 1 , w N ≈ arg max n − m +1 ) max 1 ) q T 1 ,w N w N 1 ,N 1 n =1 t =1 Viterbi approximation divides the above optimisation problem into • sub-problems that allows the e ff icient application of dynamic programming An exact search using Viterbi is infeasible for large vocabulary tasks! •

  3. Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on • the X-axis and states on the Y-axis q F end end end end v 1 (2) =.32 v 2 (2) = max(.32*.12, .02*.08) = .038 P(H|H) * P(1|H) H H H q 2 H P ( .6 * .2 C | H ) * P . ( 3 1 | C * . ) 5 ) ) H H | 1 | ( v 2 (1) = max(.32*.15, .02*.25) = .048 P 3 * ( ) P C | 2 v 1 (1) = .02 H . * ( * 4 P ) 4 t . . r a * t P(C|C) * P(1|C) 8 q 1 s C C C C . | H .5 * .5 P(C|start) * P(3|C) ( P .2 * .1 q 0 start start start start 3 1 3 o 2 o 3 o 1 Viterbi algorithm: Only needs to maintain information about the most • probable path at each state Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9

  4. ASR Search Network Network of HMM states d ax b Network of b phones oy walking are birds the 0 Network of is words boy

  5. Time-state trellis word 3 word 2 word 1 Time, t →

  6. Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that aren’t • promising

  7. Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that aren’t • promising

  8. Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Recall not all weighted transducers are determinizable To ensure determinizability of L ○ G, introduce disambiguation 
 symbols in L to deal with homophones in the lexicon read : r eh d #0 
 red : r eh d #1 Propagate the disambiguation symbols as self-loops back to 
 ̃ , L ̃ ̃ , C C and H. Resulting machines are H

  9. Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Use minimization to reduce space requirements • Minimization ensures that the final composed machine 
 has minimum number of states Final optimization cascade: ̃ ○ det(L ̃ ○ G))))) ̃ ○ det(C N = π ε (min(det(H Replaces disambiguation symbols 
 ̃ with ε in input alphabet of H

  10. Example G bob:bob slept:slept bond:bond read:read 0 1 2 rob:rob ate:ate

  11. Compact language models (G) Use Backo ff Ngram language models for G • c / Pr (c|a,b) a,b b,c ε / α (a,b) ε / α (b,c) c / Pr (c|b) b c ε / α (b) c / Pr (c) ε

  12. Example G bob:bob slept:slept bond:bond read:read 0 1 2 rob:rob ate:ate

  13. Example L ̃ :Lexicon with disambig symbols aa:- b:- 2 3 1 b:bob #0:- 4 -:- l:- eh:- 13 12 s:slept 14 p:- aa:- 10 9 r:rob b:- 11 -:- 15 aa:- n:- 6 5 b:bond 7 d:- 0 8 -:- t:- -:- ey:ate 20 t:- 21 19 -:- d:- r:read eh:- 18 17 16 -:-

  14. L ̃ ○ G p:- t:- eh:- l:- #0:- 19 21 22 aa:- b:- 16 10 13 7 1 4 b:bob s:slept -:- b:bond aa:- n:- d:- -:- r:read eh:- d:- 0 2 5 8 11 12 14 17 20 r:rob ey:ate -:- aa:- b:- 3 t:- 6 9 15 18 det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16

  15. det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16 min(det(L ̃ ○ G)) eh:- 9 5 12 d:- b:bob r:read #0:- 14 t:- eh:- p:- l:- n:bond d:- -:- s:slept aa:- 13 15 11 3 6 7 8 10 1 b:- 0 r:rob b:- ey:ate aa:- 2 4

  16. Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that aren’t • promising

  17. Beam pruning At each time-step t, only retain those nodes in the time-state • trellis that are within a fixed threshold δ (beam width) of the best path Given active nodes from the last time-step: • Examine nodes in the current time-step … • … that are reachable from active nodes in the previous time- • step Get active nodes for the current time-step by only retaining • nodes with hypotheses that score close to the score of the best hypothesis

  18. Beam search Beam search at each node keeps only hypotheses with scores • that fall within a threshold of the current best hypothesis Hypotheses with Q(t, s) < δ ⋅ max Q(t, s’) are pruned • here, δ controls the beam width • Search errors could occur if the most probable hypothesis gets pruned • Trade-o ff between balancing search errors and speeding up decoding

  19. Static and dynamic networks What we’ve seen so far: Static decoding graph • H ○ C ○ L ○ G • Determinize/minimize to make this graph more compact • Another approach: Dynamic graph expansion • Dynamically build the graph with active states on the fly • Do on-the-fly composition with the language model G • (H ○ C ○ L) ○ G •

  20. Multi-pass search Some models are too expensive to implement in first-pass • decoding (e.g. RNN-based LMs) First-pass decoding: Use simpler model (e.g. Ngram LMs) • to find most probable word sequences • and represent as a word la tu ice or an N-best list • Rescore first-pass hypotheses using complex model to find the • best word sequence

  21. DRAFT Multi-pass decoding with N-best lists Simple algorithm: Modify the Viterbi algorithm to return the N- • best word sequences for a given speech input Simple Smarter Knowledge Knowledge Source Source N-Best List 1-Best Utterance speech ?Alice was beginning to get... N-Best ?Every happy family... input If music be the ?In a hole in the ground... Rescoring Decoder food of love... ?If music be the food of love... If music be the ?If music be the foot of dove... food of love... Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10

  22. DRAFT Multi-pass decoding with N-best lists Simple algorithm: Modify the Viterbi algorithm to return the N- • best word sequences for a given speech input AM LM Rank Path logprob logprob 1. it’s an area that’s naturally sort of mysterious -7193.53 -20.25 2. that’s an area that’s naturally sort of mysterious -7192.28 -21.11 3. it’s an area that’s not really sort of mysterious -7221.68 -18.91 4. that scenario that’s naturally sort of mysterious -7189.19 -22.08 5. there’s an area that’s naturally sort of mysterious -7198.35 -21.34 6. that’s an area that’s not really sort of mysterious -7220.44 -19.77 7. the scenario that’s naturally sort of mysterious -7205.42 -21.50 8. so it’s an area that’s naturally sort of mysterious -7195.92 -21.71 9. that scenario that’s not really sort of mysterious -7217.34 -20.70 10. there’s an area that’s not really sort of mysterious -7226.51 -20.01 Figure 10.2 An example 10-Best list from the Broadcast News corpus, produced by the N-best lists aren’t as diverse as we’d like. And, not enough • information in N-best lists to e ff ectively use other knowledge sources Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10

  23. Multi-pass decoding with la tu ices ASR la tu ice: Weighted automata/directed graph representing • alternate word hypotheses from an ASR system so, it’s it’s there’s an that’s naturally sort of mysterious area that’s not really the that scenario

  24. Multi-pass decoding with la tu ices Confusion networks/sausages : La tu ices that show competing/ • confusable words and can be used to compute posterior probabilities at the word level it’s there’s an that’s naturally sort of mysterious area that’s the scenario not that

Recommend


More recommend