 
              Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars Other ASR techniques
But not just acoustics But not just acoustics • But not all phones are equi-probable • Find word sequences that maximizes • Using Bayes’ Law • Combine models – Us HMMs to provide – Use language model to provide
Beyond n-grams Beyond n-grams  Tri-gram languages models Tri-gram languages models  Good for general ASR Good for general ASR  More targeted models for dialog systems More targeted models for dialog systems  Look for more structure Look for more structure
Formal Language Theory Formal Language Theory  Chomsky Hierarchy Chomsky Hierarchy  Finite State Machines Finite State Machines  Context Free Grammars Context Free Grammars  Context Sensitive Grammars Context Sensitive Grammars  Generalized Rewrite Rules/Turing machines Generalized Rewrite Rules/Turing machines  As LM or as Understanding mechanism As LM or as Understanding mechanism  Folded into the ASR or only ran on output Folded into the ASR or only ran on output
Finite State Machines Finite State Machines  Trigram is a word^2 FSM Trigram is a word^2 FSM  FSM for greeting FSM for greeting Hello Afternoon Good Morning
Finite State Grammar Finite State Grammar  Sentences -> Start Greeting End Sentences -> Start Greeting End  Greeting -> “Hello” Greeting -> “Hello”  Greeting -> “Good” TOD Greeting -> “Good” TOD  TOD -> Morning TOD -> Morning  TOD -> Afternoon TOD -> Afternoon
Context Free Grammar Context Free Grammar  X -> Y Z X -> Y Z  Y -> “Terminal” Y -> “Terminal”  Y -> NonTerminal NonTerminal Y -> NonTerminal NonTerminal
JSGF JSGF  Simple grammar formalism for ASR Simple grammar formalism for ASR  Standard for writing ASR grammars Standard for writing ASR grammars  Actually finite state Actually finite state  http://www.w3.org/TR/jsgf http://www.w3.org/TR/jsgf
Finite State Machines Finite State Machines  Finite State Machines: Finite State Machines:  Deterministic Deterministic  Each arc leaving a state has unique label Each arc leaving a state has unique label  There always exists a Deterministic machine There always exists a Deterministic machine representing a non-Deterministic one representing a non-Deterministic one  Minimal Minimal  There exists an FSM with less (or equal) states that There exists an FSM with less (or equal) states that accepts the same language accepts the same language
Probabilistic FSMs Probabilistic FSMs  Each arc has a label and a probability Each arc has a label and a probability  Collect probabilities from data Collect probabilities from data  Can do smoothing like ngrams Can do smoothing like ngrams
Natural Language Processing Natural Language Processing  Probably mildly context sensitive Probably mildly context sensitive  i.e. you need context sensitive rules i.e. you need context sensitive rules  But if we only accept context free But if we only accept context free  Probably OK Probably OK  If we only accept finite state If we only accept finite state  Probably OK too Probably OK too
Writing Grammars for Speech Writing Grammars for Speech  What do people say? What do people say?  No what do people *really* say! No what do people *really* say!  Write examples Write examples  Please, I’d like a flight to Boston Please, I’d like a flight to Boston  I want to fly to Boston I want to fly to Boston  What do you have going to Boston What do you have going to Boston  What about Boston What about Boston  Boston Boston  Write rules grouping things together Write rules grouping things together
Ignore the unimportant things Ignore the unimportant things  I’m terribly sorry but I would greatly I’m terribly sorry but I would greatly appreciate if you might be able to help me appreciate if you might be able to help me find an acceptable flight to Boston flight to Boston . . find an acceptable  I, I wanna want to go to ehm Boston. I, I wanna want to go to ehm Boston.
What do people really say What do people really say  A: see who else will somebody else important all the A: see who else will somebody else important all the {mumble} the whole school are out for a week {mumble} the whole school are out for a week  B: oh really B: oh really  A: {lipsmack} {breath} yeah A: {lipsmack} {breath} yeah  B: okay {breath} well when are you going to come up then B: okay {breath} well when are you going to come up then  A: um let’s see well I guess I I could come up actually A: um let’s see well I guess I I could come up actually anytime anytime  B: okay well how about now B: okay well how about now  A: now A: now  B: yeah B: yeah  A: have to work tonight –laugh- A: have to work tonight –laugh-
Class based language models Class based language models  Conflate all words in same class Conflate all words in same class  Cities, Names, numbers etc Cities, Names, numbers etc  Can be automatic or designed Can be automatic or designed
Adaptive Language Models Adaptive Language Models  Update with new News stories Update with new News stories  Update your language model every day Update your language model every day  Update your language model with daily use Update your language model with daily use  Using user generated data (if ASR is good) Using user generated data (if ASR is good)
Combining models Combining models  Use “background” model Use “background” model  General tri-gram/neural model General tri-gram/neural model  Use specific model Use specific model  Grammar based Grammar based  Very localized Very localized  Combine Combine  Interpolated (just a weight factor) Interpolated (just a weight factor)  More elaborate combinations More elaborate combinations  Maximum entropy models Maximum entropy models
Vocabulary size Vocabulary size  Command and control Command and control  < 100 words, grammar based < 100 words, grammar based  Simple dialog Simple dialog  < 1000 words, grammar/tri-gram < 1000 words, grammar/tri-gram  Complex dialog Complex dialog  < 10K words, tri-gram (some grammar for control) < 10K words, tri-gram (some grammar for control)  Dictation Dictation  < 64K words, tri-gram < 64K words, tri-gram  Broadcast News Broadcast News  256K plus, tri-gram/neural (and lots of other possibilities) 256K plus, tri-gram/neural (and lots of other possibilities)
Homework 1 Homework 1  Build a speech recognition system Build a speech recognition system  An acoustic model An acoustic model  A pronunciation lexicon A pronunciation lexicon  A language model A language model  Note it takes time to build Note it takes time to build  What is your initial WER What is your initial WER  How did you improve it How did you improve it  Two stages: Two stages:  Fri 25 Fri 25 th Sep 3:30pm install and run all software th Sep 3:30pm install and run all software  Fri 2 Fri 2 nd Oct 3:30pm final submission nd Oct 3:30pm final submission
Recommend
More recommend