Administrivia Lab 4 due Thursday, 11:59pm. Lecture 10 Lab 3 handed back next week. Answers: Advanced Language Modeling /user1/faculty/stanchen/e6870/lab3_ans/ . Main feedback from last lecture. Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen Pace a little fast; derivations were “heavy”. IBM T.J. Watson Research Center Yorktown Heights, New York, USA {bhuvana,picheny,stanchen}@us.ibm.com 17 November 2009 ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 1 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 2 / 114 Where Are We? Review: Language Modeling The Fundamental Equation of Speech Recognition. Introduction 1 class ( x ) = arg max P ( ω | x ) = arg max P ( ω ) P ( x | ω ) ω ω Techniques for Restricted Domains 2 P ( ω = w 1 · · · w l ) — models frequencies of word sequences w 1 · · · w l . Techniques for Unrestricted Domains 3 Helps disambiguate acoustically ambiguous utterances. e.g. , THIS IS HOUR ROOM FOUR A FOR OUR . PERIOD Maximum Entropy Models 4 Other Directions in Language Modeling 5 An Apology 6 ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 3 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 4 / 114
Review: Language Modeling Review: N -Gram Models Small vocabulary, restricted domains. Write grammar; convert to finite-state acceptor. P ( ω = w 1 · · · w l ) Or possibly n -gram models. = P ( w 1 ) P ( w 2 | w 1 ) P ( w 3 | w 1 w 2 ) · · · P ( w l | w 1 · · · w l − 1 ) Large vocabulary, unrestricted domains. l N -gram models all the way. � = P ( w i | w 1 · · · w i − 1 ) i = 1 Markov assumption: identity of next word depends only on last n − 1 words, say n =3 P ( w i | w 1 · · · w i − 1 ) ≈ P ( w i | w i − 2 w i − 1 ) ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 5 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 6 / 114 Review: N -Gram Models Spam, Spam, Spam, Spam, and Spam Maximum likelihood estimation N -gram models are robust. Assigns nonzero probs to all word sequences. count ( w i − 2 w i − 1 w i ) P MLE ( w i | w i − 2 w i − 1 ) = Handles unrestricted domains. � w i count ( w i − 2 w i − 1 w i ) N -gram models are easy to build. count ( w i − 2 w i − 1 w i ) = Can train on plain unannotated text. count ( w i − 2 w i − 1 ) No iteration required over training corpus. N -gram models are scalable. Smoothing. Better estimation in sparse data situations. Can build models on billions of words of text, fast. Can use larger n with more data. N -gram models are great! Or are they? ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 7 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 8 / 114
The Dark Side of N -Gram Models What About Short-Distance Dependencies? In fact, n -gram models are deeply flawed. Poor generalization. Training data contains sentence: Let us count the ways. LET ’ S EAT STEAK ON TUESDAY Test data contains sentence: LET ’ S EAT SIRLOIN ON THURSDAY Occurrence of STEAK ON TUESDAY . . . Doesn’t affect estimate of P ( THURSDAY | SIRLOIN ON ) Collecting more data won’t fix this. (Brown et al. , 1992) 350MW training ⇒ 15% trigrams unseen. ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 9 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 10 / 114 Medium-Distance Dependencies? Medium-Distance Dependencies? “Medium-distance” ⇔ within sentence. Random generation of sentences with P ( ω = w 1 · · · w l ) : Roll a K -sided die where . . . Fabio example: Each side s ω corresponds to a word sequence ω . . . FABIO , WHO WAS NEXT IN LINE , ASKED IF THE And probability of landing on side s ω is P ( ω ) TELLER SPOKE . . . Reveals what word sequences model thinks is likely. Trigram model: P ( ASKED | IN LINE ) ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 11 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 12 / 114
Trigram Model, 20M Words of WSJ Medium-Distance Dependencies? Real sentences tend to “make sense” and be coherent. AND WITH WHOM IT MATTERS AND IN THE SHORT - HYPHEN TERM Don’t end/start abruptly. AT THE UNIVERSITY OF MICHIGAN IN A GENERALLY QUIET SESSION Have matching quotes. THE STUDIO EXECUTIVES LAW Are about a single subject. REVIEW WILL FOCUS ON INTERNATIONAL UNION OF THE STOCK MARKET Some are even grammatical. HOW FEDERAL LEGISLATION " DOUBLE - QUOTE SPENDING Why can’t n -gram models model this stuff? THE LOS ANGELES THE TRADE PUBLICATION SOME FORTY % PERCENT OF CASES ALLEGING GREEN PREPARING FORMS NORTH AMERICAN FREE TRADE AGREEMENT ( LEFT - PAREN NAFTA ) RIGHT - PAREN , COMMA WOULD MAKE STOCKS A MORGAN STANLEY CAPITAL INTERNATIONAL PERSPECTIVE , COMMA GENEVA " DOUBLE - QUOTE THEY WILL STANDARD ENFORCEMENT THE NEW YORK MISSILE FILINGS OF BUYERS ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 13 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 14 / 114 Long-Distance Dependencies? Recap: Shortcomings of N -Gram Models “Long-distance” ⇔ between sentences. Not great at modeling short-distance dependencies. See previous examples. Not great at modeling medium-distance dependencies. In real life, adjacent sentences tend to be on same topic. Not great at modeling long-distance dependencies. Referring to same entities, e.g. , Clinton. Basically, n -gram models are just a dumb idea. In a similar style, e.g. , formal vs. conversational. They are an insult to language modeling researchers. Why can’t n -gram models model this stuff? Are great for me to poop on. N -gram models, . . . you’re fired! P ( ω = w 1 · · · w l ) = frequency of w 1 · · · w l as sentence? ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 15 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 16 / 114
Where Are We? Where Are We? Introduction Techniques for Restricted Domains 1 2 Embedded Grammars Using Dialogue State Techniques for Restricted Domains 2 Confidence and Rejection Techniques for Unrestricted Domains 3 Maximum Entropy Models 4 Other Directions in Language Modeling 5 An Apology 6 ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 17 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 18 / 114 Improving Short-Distance Modeling Combining N -Gram Models with Grammars Issue: data sparsity/lack of generalization. Replace cities and dates, say, in training set with class tokens: I WANT TO FLY FROM BOSTON TO ALBUQUERQUE I WANT TO FLY TO [ CITY ] ON [ DATE ] I WANT TO FLY FROM AUSTIN TO JUNEAU Point: (handcrafted) grammars are good for this: Build n -gram model on new data, e.g. , P ( [ DATE ] | [ CITY ] ON ) Instead of n -gram model on words . . . [sentence] → I WANT TO FLY FROM [city] TO [city] We have n -gram model over words and classes . [city] → AUSTIN | BOSTON | JUNEAU | . . . To model probability of class expanding to particular token, use WFSM: Can we combine robustness of n -gram models . . . [CITY]:AUSTIN/0.1 With generalization ability of grammars? [CITY]:BOSTON/0.3 1 2/1 [CITY]:NEW/1 <epsilon>:YORK/0.4 3 <epsilon>:JERSEY/0.2 ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 19 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 20 / 114
The Model Implementing Embedded Grammars Given word sequence w 1 · · · w l . Need final LM as WFSA. Substitute in classes to get class/word sequence Convert word/class n -gram model to WFSM. C = c 1 · · · c l ′ . Compose with transducer expanding each class . . . To its corresponding WFSM. I WANT TO FLY TO [ CITY ] ON [ DATE ] l ′ + 1 Static or on-the-fly composition? � � P ( w 1 · · · w l ) = P ( c i | c i − 2 c i − 1 ) × P ( words ( c i ) | c i ) What if city grammar contains 100,000 cities? C i = 1 Sum over all possible ways to substitute in classes? e.g. , treat MAY as verb or date? Viterbi approximation. ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 21 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 22 / 114 Recap: Embedded Grammars Where Are We? Improves modeling of short-distance dependencies. Techniques for Restricted Domains 2 Improves modeling of medium-distance dependencies, e.g. , Embedded Grammars I WANT TO FLY TO WHITE PLAINS AIRPORT IN FIRST CLASS Using Dialogue State I WANT TO FLY TO [ CITY ] IN FIRST CLASS Confidence and Rejection More robust than grammars alone. ■❇▼ ■❇▼ EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 23 / 114 EECS 6870: Speech Recognition Advanced Language Modeling 17 November 2009 24 / 114
Recommend
More recommend