speech processing 15 492 18 492
play

Speech Processing 15-492/18-492 Multilinguality SPICE: making it - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Multilinguality SPICE: making it easier Dealing with *all* Languages Over 6000 Languages Over 6000 Languages Maybe not all commercially interesting now Maybe not all commercially interesting


  1. Speech Processing 15-492/18-492 Multilinguality SPICE: making it easier

  2. Dealing with *all* Languages Over 6000 Languages � Over 6000 Languages � � Maybe not all commercially interesting … now Maybe not all commercially interesting … now � Major languages (economic) � Major languages (economic) � � Cell phone manufacturers list 46 languages Cell phone manufacturers list 46 languages � � But even those not all covered But even those not all covered �

  3. Motivation � Computerization Computerization: Speech is key technology : Speech is key technology � � Mobile Devices, Ubiquitous Information Access Mobile Devices, Ubiquitous Information Access � � Globalization Globalization: : Multilinguality Multilinguality � � More than 6000 Languages in the world More than 6000 Languages in the world � � Multiple official languages Multiple official languages � � Europe has 20+ official languages Europe has 20+ official languages � � South Africa has 11 official languages South Africa has 11 official languages � ⇒ Speech Processing in multiple Languages Speech Processing in multiple Languages ⇒ � Cross Cross- -cultural Human cultural Human- -Human Interaction Human Interaction � � Human Human- -Machine Interface in mother tongue Machine Interface in mother tongue �

  4. Challenges � Algorithms language independent but require data Algorithms language independent but require data �  Dozens of hours audio recordings and corresponding transcription Dozens of hours audio recordings and corresponding transcriptions s   Pronunciation dictionaries for large vocabularies (>100.000 word Pronunciation dictionaries for large vocabularies (>100.000 words) s)   Millions of words written text corpora in various domains in que Millions of words written text corpora in various domains in question stion   Bilingual aligned text corpora Bilingual aligned text corpora  � BUT: Such data only available in very few languages BUT: Such data only available in very few languages � Audio data ≤ ≤ 40  Audio data 40 languages, languages, Transcriptions take up to Transcriptions take up to 40x 40x real time real time  Large vocabulary pronunciation dictionaries ≤ ≤ 20  Large vocabulary pronunciation dictionaries 20 languages languages  Small text corpora ≤ ≤ 100 ≤ 30 large corpora ≤  Small text corpora 100 languages, languages, large corpora 30 languages languages   Bilingual corpora in very few language pairs, pivot mostly Engli Bilingual corpora in very few language pairs, pivot mostly English sh  � Additional complications: Additional complications: �  Combinatorical explosion Combinatorical explosion (domain, speaking style, accent, dialect, ...) (domain, speaking style, accent, dialect, ...)   Few native speakers at hand for minority (endangered) languages Few native speakers at hand for minority (endangered) languages   Languages without writing systems Languages without writing systems 

  5. Solution: Learning Systems ⇒ Systems that learn a language from the user ⇒ Systems that learn a language from the user � Efficient Efficient learning algorithms for speech processing learning algorithms for speech processing � � Learning: Learning: �  Interactive learning with user in the loop Interactive learning with user in the loop   Statistical modeling approaches Statistical modeling approaches  � Efficiency: Efficiency: �  Reduce amount of data Reduce amount of data (save time and costs): by a factor of 10 (save time and costs): by a factor of 10   Speed up development cycles: Speed up development cycles: days rather than months days rather than months  ⇒ Rapid Language ⇒ Rapid Language Adaptation from universal models Adaptation from universal models � Bridge the gap: language and technology experts Bridge the gap: language and technology experts �  Technology experts do not speak all languages in question Technology experts do not speak all languages in question   Native users are not in control of the technology Native users are not in control of the technology 

  6. Sharing data between modules Speech-to-Speech Translation L source L target Input L s Word → Word → Word s ↔ phone phone N-grams N-grams sequence sequence Word t AM s Dict s LM s Lex st LM t Dict t AM t Output L s Word → Word → Word s ↔ phone phone N-grams N-grams sequence Word t sequence Input L t AM t AM s Dict s LM s Lex ts LM t Dict t L source L target

  7. SPICE S peech P rocessing: I nteractive C reation and E valuation toolkit • National Science Foundation, Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black • Bridge the gap between technology experts → language experts • Automatic Speech Recognition (ASR), • Machine Translation (MT), • Text-to-Speech (TTS) • Develop web-based intelligent systems • Interactive Learning with user in the loop • Rapid Adaptation of universal models to unseen languages • SPICE webpage http://cmuspice.org

  8. Spice Project Page

  9. Speech Processing Systems Phone set & Speech data Pronunciation rules Text data NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text

  10. Rapid Portability: Data Phone set & Speech data + NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text

  11. Finding “Nice” Prompts From very large text databases � From very large text databases � Find “nice” sentences: � Find “nice” sentences: � � Containing only high frequency words Containing only high frequency words � � 5 5- -15 words 15 words � Find grapheme/phoneme balanced set � Find grapheme/phoneme balanced set � � Select sentences with best Select sentences with best triphone triphone/graph /graph � 500- -1000 sentences 1000 sentences � 500 � Collect for ASR and TTS acoustic modeling � Collect for ASR and TTS acoustic modeling �

  12. Prompt Selection Issues � Need good text Need good text � � De De- -htmlify htmlify, well , well- -written, no misspelling written, no misspelling � � Need word segmentation Need word segmentation � � Japanese, Chinese Thai Japanese, Chinese Thai � � Natural text is often mixed language Natural text is often mixed language � � Hindi Newspaper Text has lots of English words Hindi Newspaper Text has lots of English words � � Automatic selection has errors Automatic selection has errors � � Need Speaker to do further selection Need Speaker to do further selection � � E.g. lots of telephone numbers, E.g. lots of telephone numbers, formating formating commands commands � � CMU Arctic used similar methods CMU Arctic used similar methods �

  13. Recording Prompts

  14. GlobalPhone Multilingual Database � Widespread languages � Native Speakers � Uniform Data � Broad Domain � Large Text Resources � Internet, Newspaper Corpus � 19 Languages … counting Arabic Croatian Turkish � ≥ 1800 native speakers Ch-Mandarin Portuguese + Thai � ≥ 400 hrs Audio data Ch-Shanghai Russian + Creole German Spanish + Polish � Read Speech French Swedish + Bulgarian � Filled pauses annotated Japanese Tamil + ... ??? Now available from ELRA !! Korean Czech

  15. Speech Recognition in 17 Languages 40 33.5 29 29 30 Word Error Rate [%] 23.4 21.7 16.9 18 19 20 20 20 20 14 14 14.514.5 11.8 10 10 0 English Ch-Mandarin Turkish German Thai French Portuguese Croatian Spanish Bulgarian Afrikaans Chinese Arabic Japanese Korean Russian Iraqi

  16. Rapid Portability: Acoustic Models Phone set & Speech data + NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text

  17. Universal Sound Inventory ⇒ IPA Speech Production is independent from Language 1) IPA-based Universal Sound Inventory 2) Each sound class is trained by data sharing � Reduction from 485 to 162 sound classes � m,n,s,l appear in all 12 languages � p,b,t,d,k,g,f and i,u,e,a,o in almost all B laukra ut k Problem: Bra utkle id lau k ra Br otkor b ut k le We inkar te ot k or Context of sounds are language specific in k ar -1=Plosiv? Context dependent models for new languages? N J k (0) +2=Vokal? Solution: N J k (1) k (2) 1) Multilingual Decision Context Trees lau k ra in k ar 2) Specialize decision tree by Adaptation ot k or ut k le

  18. Choosing Phonemes

  19. Rapid Portability: Acoustic Model 100 Ø Tree ML-Tree Po-Tree PDTS 80 69,1 Word Error rate [%] 57,1 60 49,9 40,6 40 32,8 28,9 19,6 19 20 0 0 0:15 0:15 0:25 0:25 0:25 1:30 16:30 +

  20. Rapid Portability: Pronunciation Dictionary Pronunciation rules Textdaten � /a/ /d/ /i/ /o/ /s/ „adios“ � /h/ /a/ /l/ /o/ „Hallo“ � ??? „Phydough“ NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text

Recommend


More recommend