speech processing speech processing
play

Speech Processing Speech Processing Using Speech with Computers - PowerPoint PPT Presentation

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs Text Speech vs Text Same but different Same but different Core Speech Technologies Core Speech Technologies Speech Recognition Speech


  1. Speech Processing Speech Processing Using Speech with Computers

  2. Overview Overview  Speech vs Text Speech vs Text  Same but different Same but different  Core Speech Technologies Core Speech Technologies  Speech Recognition Speech Recognition  Speech Synthesis Speech Synthesis  Dialog Systems Dialog Systems

  3. Pronunciation Lexicon Pronunciation Lexicon  List of words and their pronunciation List of words and their pronunciation  (“pencil” n (p eh1 n s ih l)) (“pencil” n (p eh1 n s ih l))  (“table” n (t ey1 b ax l)) (“table” n (t ey1 b ax l))  Need the right phoneme set Need the right phoneme set  Need other information Need other information  Part of speech Part of speech  Lexical stress Lexical stress  Other information (Tone, Lexical accent …) Other information (Tone, Lexical accent …)  Syllable boundaries Syllable boundaries

  4. Homograph Representation Homograph Representation  Must distinguish different pronunciations Must distinguish different pronunciations  (“project” n (p r aa1 jh eh k t)) (“project” n (p r aa1 jh eh k t))  (“project” v (p r ax jh eh1 k t)) (“project” v (p r ax jh eh1 k t))  (“bass” n_music (b ey1 s)) (“bass” n_music (b ey1 s))  (“bass” n_fish (b ae1 s)) (“bass” n_fish (b ae1 s))  ASR multiple pronunciations ASR multiple pronunciations  (“route” n (r uw t)) (“route” n (r uw t))  (“route(2)” n (r aw t)) (“route(2)” n (r aw t))

  5. Pronunciation of Unknown Words Pronunciation of Unknown Words  How do you pronounce new words How do you pronounce new words  4% of tokens (in news) are new 4% of tokens (in news) are new  You can’t synthesis them without You can’t synthesis them without pronunciations pronunciations  You can’t recognize them without You can’t recognize them without pronunciations pronunciations  Letter-to-Sounds rules Letter-to-Sounds rules  Grapheme-to-Phoneme rules Grapheme-to-Phoneme rules

  6. LTS: Hand written LTS: Hand written  Hand written rules Hand written rules  [LeftContext] X [RightContext] -> Y [LeftContext] X [RightContext] -> Y  e.g. Pronunciation of letter “c” e.g. Pronunciation of letter “c”  c [h r] -> k c [h r] -> k  c [h] -> ch c [h] -> ch  c [i] -> s c [i] -> s  c -> k c -> k

  7. LTS: Machine Learning Techniques LTS: Machine Learning Techniques  Need an existing lexicon Need an existing lexicon  Pronunciations: words and phones Pronunciations: words and phones  But different number of letters and phones But different number of letters and phones  Need an alignment Need an alignment  Between letters and phones Between letters and phones  checked -> ch eh k t checked -> ch eh k t

  8. LTS: alignment LTS: alignment  checked -> ch eh k t checked -> ch eh k t c h e c k e d c h e c k e d ch _ _ eh k k _ _ t ch eh _ _ t  Some letters go to nothing Some letters go to nothing  Some letters go to two phones Some letters go to two phones  box -> b aa k-s box -> b aa k-s  table -> t ey b ax-l - table -> t ey b ax-l -

  9. Find alignment automatically Find alignment automatically  Epsilon scattering Epsilon scattering  Find all possible alignments Find all possible alignments  Estimate p(L,P) on each alignment Estimate p(L,P) on each alignment  Find most probable alignment Find most probable alignment  Hand seed Hand seed  Hand specify allowable pairs Hand specify allowable pairs  Estimate p(L,P) on each possible alignment Estimate p(L,P) on each possible alignment  Find most probable alignment Find most probable alignment  Statistical Machine Translation (IBM model 1) Statistical Machine Translation (IBM model 1)  Estimate p(L,P) on each possible alignment Estimate p(L,P) on each possible alignment  Find most probable alignment Find most probable alignment

  10. Not everything aligns Not everything aligns  0, 1, and 2 letter cases 0, 1, and 2 letter cases  e -> epsilon “moved” e -> epsilon “moved”  x -> k-s, g-z “box” “example” x -> k-s, g-z “box” “example”  e -> y-uw “askew” e -> y-uw “askew”  Some alignments aren’t sensible Some alignments aren’t sensible  dept -> d ih p aa r t m ax n t dept -> d ih p aa r t m ax n t  cmu -> s iy eh m y uw cmu -> s iy eh m y uw

  11. Training LTS models Training LTS models  Use CART trees Use CART trees  One model for each letter One model for each letter  Predict phone (epsilon, phone, dual phone) Predict phone (epsilon, phone, dual phone)  From letter 3-context (and POS) From letter 3-context (and POS)  # # # c h e c -> ch # # # c h e c -> ch  # # c h e c k -> _ # # c h e c k -> _  # c h e c k e -> eh # c h e c k e -> eh  c h e c k e d -> k c h e c k e d -> k

  12. LTS results LTS results  Split lexicon into train/test 90%/10% Split lexicon into train/test 90%/10%  i.e. every tenth entry is extracted for testing i.e. every tenth entry is extracted for testing Lexicon Letter Acc Word Acc Lexicon Letter Acc Word Acc OALD 95.80% 75.56% OALD 95.80% 75.56% CMUDICT 91.99% 57.80% CMUDICT 91.99% 57.80% BRULEX 99.00% 93.03% BRULEX 99.00% 93.03% DE-CELEX 98.79% 89.38% DE-CELEX 98.79% 89.38% Thai 95.60% 68.76% Thai 95.60% 68.76%

  13. Example Tree Example Tree

  14. But we need more than phones But we need more than phones  What about lexical stress What about lexical stress  p r aa1 j eh k t -> p r aa j eh1 k t p r aa1 j eh k t -> p r aa j eh1 k t  Two possibilities Two possibilities  A separate prediction model A separate prediction model  Join model – introduce eh/eh1 (BETTER) Join model – introduce eh/eh1 (BETTER) LTP+S LTPS LTP+S LTPS L no S 96.36% 96.27% L no S 96.36% 96.27% Letter --- 95.80% Letter --- 95.80% W no S 76.92% 74.69% W no S 76.92% 74.69% Word 63.68% 74.56% Word 63.68% 74.56%

  15. Does it really work Does it really work  40K words from Time Magazine 40K words from Time Magazine  1775 (4.6%) not in OALD 1775 (4.6%) not in OALD  LTS gets 70% correct (test set was 74%) LTS gets 70% correct (test set was 74%) Occurs % Occurs % Names 1360 76.6 Names 1360 76.6 Unknown 351 19.8 Unknown 351 19.8 US Spelling 57 3.2 US Spelling 57 3.2 Typos 7 0.4 Typos 7 0.4

  16. Spoken Dialog Systems Spoken Dialog Systems  Information giving Information giving  Flights, buses, stocks weather Flights, buses, stocks weather  Driving directions Driving directions  News News  Information navigators Information navigators  Read your mail Read your mail  Search the web Search the web  Answer questions Answer questions  Provide personalities Provide personalities  Game characters (NPC), toys, robots, chatbots Game characters (NPC), toys, robots, chatbots  Speech-to-speech translation Speech-to-speech translation  Cross-lingual interaction Cross-lingual interaction

  17. Dialog Types Dialog Types  System initiative System initiative  Form-filling paradigm Form-filling paradigm  Can switch language models at each turn Can switch language models at each turn  Can “know” which is likely to be said Can “know” which is likely to be said  Mixed initiative Mixed initiative  Users can go where they like Users can go where they like  System or user can lead the discussion System or user can lead the discussion  Classifying: Classifying:  Users can say what they like Users can say what they like  But really only “N” operations possible But really only “N” operations possible  E.g. AT&T? “How may I help you?” E.g. AT&T? “How may I help you?”  Non-task oriented Non-task oriented

  18. System Initiative System Initiative  Let’s Go Bus Information Let’s Go Bus Information  412 268 3526 412 268 3526  Provides bus information for Pittsburgh Provides bus information for Pittsburgh  Tell Me Tell Me  Company getting others to build systems Company getting others to build systems  Stocks, weather, entertainment Stocks, weather, entertainment  1 800 555 8355 1 800 555 8355

  19. SDS Architecture SDS Architecture Recognition Interpretation Dialog Manager Synthesis Generation

  20. SDS Components SDS Components  Interpretation Interpretation  Parsing and Information Extraction Parsing and Information Extraction  (Ignore politeness and find the departure stop) (Ignore politeness and find the departure stop)  Generation Generation  From SQL table output from DB From SQL table output from DB  Generate “nice” text to say Generate “nice” text to say

  21. Siri-like Assistants Siri-like Assistants  Advantages Advantages  Hard to type/select things on phone Hard to type/select things on phone  Can use context (location, contacts, calendar) Can use context (location, contacts, calendar)  Target common tasks Target common tasks  Calling, sending messages, calendar Calling, sending messages, calendar  Fall back on google lookup Fall back on google lookup

Recommend


More recommend