Administrivia Main feedback from last lecture. Lecture 8 Mud: k -means clustering. Lab 2 handed back today. LVCSR Decoding Answers: /user1/faculty/stanchen/e6870/lab2_ans/ . Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen Lab 3 due Thursday, 11:59pm. Next week: Election Day. IBM T.J. Watson Research Center Yorktown Heights, New York, USA Lab 4 out by then? {bhuvana,picheny,stanchen}@us.ibm.com 27 October 2009 ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 1 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 2 / 138 The Big Picture Outline Weeks 1–4: Small vocabulary ASR. Part I: Introduction to LVCSR decoding, i.e. , search . Weeks 5–8: Large vocabulary ASR. Part II: Finite-state transducers. Week 5: Language modeling. Part III: Making decoding efficient. Week 6: Pronunciation modeling ⇔ acoustic modeling Part IV: Other decoding paradigms. for large vocabularies. Week 7: Training for large vocabularies. Week 8: Decoding for large vocabularies. Weeks 9–13: Advanced topics. ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 3 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 4 / 138
Part I Decoding for LVCSR Introduction to LVCSR Decoding class ( x ) = arg max P ( ω | x ) ω P ( ω ) P ( x | ω ) = arg max P ( x ) ω = arg max P ( ω ) P ( x | ω ) ω Now that we know how to build models for LVCSR . . . n -gram models via counting and smoothing. CD acoustic models via complex recipes. How can we use them for decoding? ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 5 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 6 / 138 Decoding: Small Vocabulary Issue: Are N -Gram Models WFSA’s? Take graph/WFSA representing language model. Yup. UH One state for each ( n − 1 ) -gram history ω . LIKE All paths ending in state ω . . . i.e. , all allowable word sequences. Are labeled with word sequence ending in ω . Expand to underlying HMM. State ω has outgoing arc for each word w . . . With arc probability P ( w | ω ) . LIKE UH Run the Viterbi algorithm! ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 7 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 8 / 138
Bigram, Trigram LM’s Over Two Word Vocab Pop Quiz How many states in FSA representing n -gram model . . . w1/P(w1|w1) w2/P(w2|w2) With vocabulary size | V | ? How many arcs? w2/P(w2|w1) h=w1 h=w2 w1/P(w1|w2) w1/P(w1|w1,w2) w2/P(w2|w2,w2) h=w2,w2 w2/P(w2|w1,w2) w1/P(w1|w2,w2) w1/P(w1|w1,w1) h=w1,w2 w2/P(w2|w1,w1) h=w2,w1 w2/P(w2|w2,w1) w1/P(w1|w2,w1) h=w1,w1 ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 9 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 10 / 138 Issue: Graph Expansion Context-Dependent Graph Expansion Word models. DH Replace each word with its HMM. AH CI phone models. D Replace each word with its phone sequence(s). AO Replace each phone with its HMM. G LIKE/P(LIKE|UH) How can we do context-dependent expansion? UH/P(UH|UH) h=UH UH/P(UH|LIKE) Handling branch points is tricky. LIKE/P(LIKE|LIKE) Other tricky cases. h=LIKE Words consisting of a single phone. Quinphone models. ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 11 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 12 / 138
Triphone Graph Expansion Example Aside: Word-Internal Acoustic Models Simplify acoustic model to simplify graph expansion. DH Word-internal models. AH Don’t let decision trees ask questions across word D boundaries. AO Pad contexts with the unknown phone . G Hurts performance ( e.g. , coarticulation across words). As with word models, just replace each word with its HMM. DH_AH_DH AO_G_DH G_DH_AH AH_DH_AH AO_G_D DH_AH_D G_D_AO D_AO_G AH_D_AO ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 13 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 14 / 138 Issue: How Big The Graph? Issue: How Slow Decoding? Trigram model ( e.g. , vocabulary size | V | = 2) In each frame, loop through every state in graph. If 100 frames/sec, 10 15 states . . . w1/P(w1|w1,w2) w2/P(w2|w2,w2) How many cells to compute per second? PC’s can do ∼ 10 10 floating-point ops per second. h=w2,w2 w2/P(w2|w1,w2) w1/P(w1|w2,w2) w1/P(w1|w1,w1) h=w1,w2 w2/P(w2|w1,w1) h=w2,w1 w2/P(w2|w2,w1) w1/P(w1|w2,w1) h=w1,w1 | V | 3 word arcs in FSA representation. Say words are ∼ 4 phones = 12 states on average. If | V | = 50000, 50000 3 × 12 ≈ 10 15 states in graph. PC’s have ∼ 10 9 bytes of memory. ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 15 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 16 / 138
Part II Recap: Small vs. Large Vocabulary Decoding In theory, can use the same exact techniques. Finite-State Transducers In practice, three big problems: (Context-dependent) graph expansion is complicated. The decoding graph would be way too big. Decoding would be way too slow. ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 17 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 18 / 138 A View of Graph Expansion A Framework for Rewriting Graphs Step 1: Take word graph as input. A general way of representing graph transformations? Convert into phone graph. Finite-state transducers (FST’s). Step 2: Take phone graph as input. A general operation for applying transformations to graphs? Convert into context-dependent phone graph. Composition. Step 3: Take context-dependent phone graph. Convert into HMM. ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 19 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 20 / 138
Where Are We? Review: What is a Finite-State Acceptor? It has states. What Is an FST? 1 Exactly one initial state; one or more final states. It has arcs. Composition 2 Each arc has a label, which may be empty ( ǫ ). Ignore probabilities for now. FST’s, Composition, and ASR 3 c b 2 <epsilon> a Weights 4 3 1 a ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 21 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 22 / 138 What Does an FSA Mean ? What is a Finite-State Transducer? The (possibly infinite) list of strings it accepts. It’s like a finite-state acceptor, except . . . We need this in order to define composition. Each arc has two labels instead of one. Things that don’t affect meaning. An input label (possibly empty). An output label (possibly empty). How labels are distributed along a path. Invalid paths. c:c b:a Are these equivalent? a:<epsilon> 2 <epsilon>:b 3 a 1 a:a <epsilon> a b ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 23 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 24 / 138
What Does an FST Mean ? Terminology A (possibly infinite) list of pairs of strings . . . Finite-state acceptor (FSA): one label on each arc. An input string and an output string. Finite-state transducer (FST): input and output label on each arc. The gist of composition . Finite-state machine (FSM): FSA or FST. If string i 1 · · · i N occurs in input graph . . . And ( i 1 · · · i N , o 1 · · · o M ) occurs in transducer, . . . Also, finite-state automaton . Then string o 1 · · · o M occurs in output graph. ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 25 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 26 / 138 Where Are We? The Composition Operation A simple and efficient algorithm for computing . . . What Is an FST? 1 Result of applying a transducer to an acceptor. Composing FSA A with FST T to get FSA A ◦ T . Composition 2 If string i 1 · · · i N ∈ A and . . . Input/output string pair ( i 1 · · · i N , o 1 · · · o M ) ∈ T , . . . FST’s, Composition, and ASR 3 Then string o 1 · · · o M ∈ A ◦ T . Weights 4 ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 27 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 28 / 138
Rewriting a Single String A Single Way Rewriting a Single String A Single Way a b d a b d A A 1 2 3 4 1 2 3 4 d:D a:A b:B d:D T 1 2 3 4 c:C b:B a:A T 1 A B D A ◦ T 1 2 3 4 A B D A ◦ T 1 2 3 4 ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 29 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 30 / 138 Transforming a Single String The Magic of FST’s and Composition Let’s say you have a string, e.g. , Let’s say you have a (possibly infinite) list of strings . . . Expressed as an FSA, as this is compact. THE DOG How to transform all strings in FSA in one go? Let’s say we want to apply a one-to-one transformation. e.g. , map words to their (single) baseforms. How to do one-to-many or one-to-zero transformations? Can we have the (possibly infinite) list of output strings . . . DH AH D AO G Expressed as an FSA, as this is compact? This is easy, e.g. , use sed or perl or . . . Fast? ■❇▼ ■❇▼ EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 31 / 138 EECS 6870: Speech Recognition LVCSR Decoding 27 October 2009 32 / 138
Recommend
More recommend