search and decoding
play

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi - PowerPoint PPT Presentation

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on the X-axis and states on the Y-axis q F end end end end v 1 (2) =.32 v 2


  1. Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi

  2. Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on • the X-axis and states on the Y-axis q F end end end end v 1 (2) =.32 v 2 (2) = max(.32*.12, .02*.08) = .038 P(H|H) * P(1|H) H H H q 2 H P ( .6 * .2 C | H ) * P . ( 3 1 | * C . ) 5 P(H|start)*P(3|H) ) H | 1 ( v 2 (1) = max(.32*.15, .02*.25) = .048 P * ) C 2 | v 1 (1) = .02 H . ( * .8 * .4 P 4 . P(C|C) * P(1|C) q 1 C C C C .5 * .5 P(C|start) * P(3|C) .2 * .1 q 0 start start start start 3 1 3 o 1 o 2 o 3 Viterbi algorithm: Only needs to maintain information about the most • probable path at each state Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9

  3. ASR Search Network Network of HMM states d ax b Network of b phones oy walking are birds the 0 Network of is words boy

  4. Time-state trellis word 3 word 2 word 1 Time, t →

  5. Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that • aren’t promising

  6. Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that • aren’t promising

  7. Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Recall not all weighted transducers are determinizable To ensure determinizability of L ○ G, introduce disambiguation 
 symbols in L to deal with homophones in the lexicon read : r eh d #1 
 red : r eh d #2 Propagate the disambiguation symbols as self-loops back to 
 C and H. Resulting machines are H ̃ , C ̃ , L ̃

  8. Two main WFST Optimizations Use determinization to reduce/eliminate redundancy • Use minimization to reduce space requirements • Minimization ensures that the final composed machine 
 has minimum number of states Final optimization cascade: N = π ε (min(det(H ̃ ○ det(C ̃ ○ det(L ̃ ○ G))))) Replaces disambiguation symbols 
 in input alphabet of H ̃ with ε

  9. Example G bob:bob slept:slept bond:bond read:read 0 1 2 rob:rob ate:ate

  10. Example L ̃ :Lexicon with disambig symbols aa:- b:- 2 3 1 b:bob #0:- 4 -:- l:- 13 eh:- 12 s:slept 14 p:- aa:- 10 9 r:rob b:- 11 -:- 15 aa:- n:- 6 5 b:bond 7 d:- 0 8 -:- t:- -:- ey:ate 20 t:- 21 19 -:- d:- r:read eh:- 18 17 16 -:-

  11. L ̃ ○ G p:- t:- eh:- l:- #0:- aa:- b:- 19 21 22 16 10 13 7 1 4 b:bob s:slept -:- b:bond aa:- n:- d:- -:- r:read eh:- d:- 0 2 5 8 11 12 14 17 20 r:rob ey:ate -:- aa:- b:- 3 t:- 6 9 15 18 det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16

  12. det(L ̃ ○ G) d:- eh:- 14 17 #0:- 11 5 8 b:bob r:read -:- aa:- n:bond d:- -:- s:slept l:- eh:- p:- t:- 3 6 9 10 12 15 18 19 20 1 b:- ey:ate 0 r:rob -:- aa:- b:- 2 4 7 t:- 13 16 min(det(L ̃ ○ G)) eh:- 9 5 12 d:- b:bob r:read #0:- 14 t:- eh:- p:- l:- n:bond d:- -:- s:slept aa:- 13 15 11 3 6 7 8 10 1 b:- 0 r:rob b:- ey:ate aa:- 2 4

  13. 1st pass recognition networks (40K vocab) transducer x real-time 12.5 C ◦ L ◦ G C ◦ det ( L ◦ G ) 1.2 det ( H ◦ C ◦ L ◦ G ) 1.0 push ( min ( F )) 0.7 Recognition speeds for systems with an accuracy of 83%

  14. Static and dynamic networks What we’ve seen so far: Static decoding graph • H ○ C ○ L ○ G • Determinize/minimize to make this graph more compact • Another approach: Dynamic graph expansion • Dynamically build the graph with active states on the fly • Do on-the-fly composition with the language model G • (H ○ C ○ L) ○ G •

  15. Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks • Unknown word boundaries • Ngram language models greatly increase the search space • Solutions • Compactly represent the search space using WFST-based • optimisations Beam search: Prune away parts of the search space that • aren’t promising

  16. Beam pruning At each time-step t, only retain those nodes in the time- • state trellis that are within a fixed threshold δ (beam width) of the best path Given active nodes from the last time-step: • Examine nodes in the current time-step … • … that are reachable from active nodes in the previous • time-step Get active nodes for the current time-step by only • retaining nodes with hypotheses that score close to the score of the best hypothesis

  17. Viterbi beam search decoder Time-synchronous search algorithm: • For time t, each state is updated by the best score from all • states in time t-1 Beam search prunes unpromising states at every time step. • At each time-step t, only retain those nodes in the time- • state trellis that are within a fixed threshold δ (beam width) of the score of the best hypothesis.

  18. Beam search algorithm Initialization: current states := initial state while (current states do not contain the goal state) do: successor states := NEXT(current states) 
 where NEXT is next state function score the successor states set current states to a pruned set of successor states using beam width δ
 only retain those successor states that are within 
 δ times the best path weight

  19. ⋯⋯ ⋯⋯ Beam search over the decoding graph Say δ = 2 Score of arc: 
 -log P(O 1 |x 1 ) + graph cost x 1 :the x 200 :the x 2 :a O 1 O 2 O 3 O T

  20. ̂ ̂ Beam search in a seq2seq model y 2 | x , "a" ) P ( ̂ a y 2 | x , "e" ) P ( ̂ Say δ = 3 y 1 y 2 e y 2 | x , "u" ) P ( ̂ u DECODER s i X c i = α ij h j j α i 1 α iM α ij h 1 h j h M ENCODER

  21. Lattices “ Lattices ” are useful when more than one hypothesis is • desired from a recognition pass A lattice is a weighted, directed acyclic graph which • encodes a large number of ASR hypotheses weighted by acoustic model +language model scores specific to a given utterance

  22. Lattice Generation Say we want to decode an utterance, U, of T frames. • Construct a sausage acceptor for this utterance, X, with T+1 • states and arcs for each context-dependent HMM state at each time-step Search the following composed machine for the best word • sequence corresponding to U: 
 
 D = X ○ HCLG

  23. Lattice Generation For all practical applications, we have to use beam pruning over D • such that only a subset of states/arcs in D are visited. Call this resulting pruned machine, B. Word lattice, say L, is a further pruned version of B defined by a • lattice beam, β . L satisfies the following requirements: L should have a path for every word sequence within β of the best- • scoring path in B All scores and alignments in L correspond to actual paths through • B L does not contain duplicate paths with the same word sequence •

Recommend


More recommend