Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 18: Search & Decoding (Part I) Instructor: Preethi Jyothi Mar 23, 2017
Recall ASR Decoding W ∗ = arg max Pr( O A | W ) Pr( W ) W " N 8 9 # 2 3 T < = W ∗ = arg max Y 4 X Y Pr( w n | w n − 1 Pr( O t | q t , w N 1 ) Pr( q t | q t − 1 , w N n − m +1 ) 1 ) 5 w N 1 ,N : ; n =1 t =1 q T 1 ,w N 1 (" N # " T #) Viterbi Y Y Pr( w n | w n − 1 Pr( O t | q t , w N 1 ) Pr( q t | q t − 1 , w N ≈ arg max n − m +1 ) max 1 ) q T 1 ,w N w N 1 ,N 1 n =1 t =1 Viterbi approximation divides the above optimisation problem into • sub-problems that allows the e ff icient application of dynamic programming An exact search using Viterbi is infeasible for large vocabulary tasks! •
Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on • the X-axis and states on the Y-axis q F end end end end v 1 (2) =.32 v 2 (2) = max(.32*.12, .02*.08) = .038 P(H|H) * P(1|H) H H H q 2 H P ( .6 * .2 C | H ) * P . ( 3 1 | C * . ) 5 ) ) H H | 1 | ( v 2 (1) = max(.32*.15, .02*.25) = .048 P 3 * ( ) P C | 2 v 1 (1) = .02 H . * ( * 4 P ) 4 t . . r a * t P(C|C) * P(1|C) 8 q 1 s C C C C . | H .5 * .5 P(C|start) * P(3|C) ( P .2 * .1 q 0 start start start start 3 1 3 o 2 o 3 o 1 Viterbi algorithm: Only needs to maintain information about the most • probable path at each state Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9
ASR Search Network Network of HMM states d ax b Network of b phones oy walking are birds the 0 Network of is words boy
Time-state trellis word 3 word 2 word 1 Time, t →
Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that aren’t • promising
Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that aren’t • promising
Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Recall not all weighted transducers are determinizable To ensure determinizability of L ○ G, introduce disambiguation symbols in L to deal with homophones in the lexicon read : r eh d #0 red : r eh d #1 Propagate the disambiguation symbols as self-loops back to ̃ , L ̃ ̃ , C C and H. Resulting machines are H
Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Use minimization to reduce space requirements • Minimization ensures that the final composed machine has minimum number of states Final optimization cascade: ̃ ○ det(L ̃ ○ G))))) ̃ ○ det(C N = π ε (min(det(H Replaces disambiguation symbols ̃ with ε in input alphabet of H
Example G bob:bob slept:slept bond:bond read:read 0 1 2 rob:rob ate:ate
Compact language models (G) Use Backo ff Ngram language models for G • c / Pr (c|a,b) a,b b,c ε / α (a,b) ε / α (b,c) c / Pr (c|b) b c ε / α (b) c / Pr (c) ε
Example G bob:bob slept:slept bond:bond read:read 0 1 2 rob:rob ate:ate
Example L ̃ :Lexicon with disambig symbols aa:- b:- 2 3 1 b:bob #0:- 4 -:- l:- eh:- 13 12 s:slept 14 p:- aa:- 10 9 r:rob b:- 11 -:- 15 aa:- n:- 6 5 b:bond 7 d:- 0 8 -:- t:- -:- ey:ate 20 t:- 21 19 -:- d:- r:read eh:- 18 17 16 -:-
L ̃ ○ G p:- t:- eh:- l:- #0:- 19 21 22 aa:- b:- 16 10 13 7 1 4 b:bob s:slept -:- b:bond aa:- n:- d:- -:- r:read eh:- d:- 0 2 5 8 11 12 14 17 20 r:rob ey:ate -:- aa:- b:- 3 t:- 6 9 15 18 det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16
det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16 min(det(L ̃ ○ G)) eh:- 9 5 12 d:- b:bob r:read #0:- 14 t:- eh:- p:- l:- n:bond d:- -:- s:slept aa:- 13 15 11 3 6 7 8 10 1 b:- 0 r:rob b:- ey:ate aa:- 2 4
Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that aren’t • promising
Beam pruning At each time-step t, only retain those nodes in the time-state • trellis that are within a fixed threshold δ (beam width) of the best path Given active nodes from the last time-step: • Examine nodes in the current time-step … • … that are reachable from active nodes in the previous time- • step Get active nodes for the current time-step by only retaining • nodes with hypotheses that score close to the score of the best hypothesis
Beam search Beam search at each node keeps only hypotheses with scores • that fall within a threshold of the current best hypothesis Hypotheses with Q(t, s) < δ ⋅ max Q(t, s’) are pruned • here, δ controls the beam width • Search errors could occur if the most probable hypothesis gets pruned • Trade-o ff between balancing search errors and speeding up decoding
Static and dynamic networks What we’ve seen so far: Static decoding graph • H ○ C ○ L ○ G • Determinize/minimize to make this graph more compact • Another approach: Dynamic graph expansion • Dynamically build the graph with active states on the fly • Do on-the-fly composition with the language model G • (H ○ C ○ L) ○ G •
Multi-pass search Some models are too expensive to implement in first-pass • decoding (e.g. RNN-based LMs) First-pass decoding: Use simpler model (e.g. Ngram LMs) • to find most probable word sequences • and represent as a word la tu ice or an N-best list • Rescore first-pass hypotheses using complex model to find the • best word sequence
DRAFT Multi-pass decoding with N-best lists Simple algorithm: Modify the Viterbi algorithm to return the N- • best word sequences for a given speech input Simple Smarter Knowledge Knowledge Source Source N-Best List 1-Best Utterance speech ?Alice was beginning to get... N-Best ?Every happy family... input If music be the ?In a hole in the ground... Rescoring Decoder food of love... ?If music be the food of love... If music be the ?If music be the foot of dove... food of love... Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10
DRAFT Multi-pass decoding with N-best lists Simple algorithm: Modify the Viterbi algorithm to return the N- • best word sequences for a given speech input AM LM Rank Path logprob logprob 1. it’s an area that’s naturally sort of mysterious -7193.53 -20.25 2. that’s an area that’s naturally sort of mysterious -7192.28 -21.11 3. it’s an area that’s not really sort of mysterious -7221.68 -18.91 4. that scenario that’s naturally sort of mysterious -7189.19 -22.08 5. there’s an area that’s naturally sort of mysterious -7198.35 -21.34 6. that’s an area that’s not really sort of mysterious -7220.44 -19.77 7. the scenario that’s naturally sort of mysterious -7205.42 -21.50 8. so it’s an area that’s naturally sort of mysterious -7195.92 -21.71 9. that scenario that’s not really sort of mysterious -7217.34 -20.70 10. there’s an area that’s not really sort of mysterious -7226.51 -20.01 Figure 10.2 An example 10-Best list from the Broadcast News corpus, produced by the N-best lists aren’t as diverse as we’d like. And, not enough • information in N-best lists to e ff ectively use other knowledge sources Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10
Multi-pass decoding with la tu ices ASR la tu ice: Weighted automata/directed graph representing • alternate word hypotheses from an ASR system so, it’s it’s there’s an that’s naturally sort of mysterious area that’s not really the that scenario
Multi-pass decoding with la tu ices Confusion networks/sausages : La tu ices that show competing/ • confusable words and can be used to compute posterior probabilities at the word level it’s there’s an that’s naturally sort of mysterious area that’s the scenario not that
Recommend
More recommend