Speech Processing 15-492/18-492 Multilinguality SPICE: making it easier
Dealing with *all* Languages Over 6000 Languages � Over 6000 Languages � � Maybe not all commercially interesting … now Maybe not all commercially interesting … now � Major languages (economic) � Major languages (economic) � � Cell phone manufacturers list 46 languages Cell phone manufacturers list 46 languages � � But even those not all covered But even those not all covered �
Motivation � Computerization Computerization: Speech is key technology : Speech is key technology � � Mobile Devices, Ubiquitous Information Access Mobile Devices, Ubiquitous Information Access � � Globalization Globalization: : Multilinguality Multilinguality � � More than 6000 Languages in the world More than 6000 Languages in the world � � Multiple official languages Multiple official languages � � Europe has 20+ official languages Europe has 20+ official languages � � South Africa has 11 official languages South Africa has 11 official languages � ⇒ Speech Processing in multiple Languages Speech Processing in multiple Languages ⇒ � Cross Cross- -cultural Human cultural Human- -Human Interaction Human Interaction � � Human Human- -Machine Interface in mother tongue Machine Interface in mother tongue �
Challenges � Algorithms language independent but require data Algorithms language independent but require data � Dozens of hours audio recordings and corresponding transcription Dozens of hours audio recordings and corresponding transcriptions s Pronunciation dictionaries for large vocabularies (>100.000 word Pronunciation dictionaries for large vocabularies (>100.000 words) s) Millions of words written text corpora in various domains in que Millions of words written text corpora in various domains in question stion Bilingual aligned text corpora Bilingual aligned text corpora � BUT: Such data only available in very few languages BUT: Such data only available in very few languages � Audio data ≤ ≤ 40 Audio data 40 languages, languages, Transcriptions take up to Transcriptions take up to 40x 40x real time real time Large vocabulary pronunciation dictionaries ≤ ≤ 20 Large vocabulary pronunciation dictionaries 20 languages languages Small text corpora ≤ ≤ 100 ≤ 30 large corpora ≤ Small text corpora 100 languages, languages, large corpora 30 languages languages Bilingual corpora in very few language pairs, pivot mostly Engli Bilingual corpora in very few language pairs, pivot mostly English sh � Additional complications: Additional complications: � Combinatorical explosion Combinatorical explosion (domain, speaking style, accent, dialect, ...) (domain, speaking style, accent, dialect, ...) Few native speakers at hand for minority (endangered) languages Few native speakers at hand for minority (endangered) languages Languages without writing systems Languages without writing systems
Solution: Learning Systems ⇒ Systems that learn a language from the user ⇒ Systems that learn a language from the user � Efficient Efficient learning algorithms for speech processing learning algorithms for speech processing � � Learning: Learning: � Interactive learning with user in the loop Interactive learning with user in the loop Statistical modeling approaches Statistical modeling approaches � Efficiency: Efficiency: � Reduce amount of data Reduce amount of data (save time and costs): by a factor of 10 (save time and costs): by a factor of 10 Speed up development cycles: Speed up development cycles: days rather than months days rather than months ⇒ Rapid Language ⇒ Rapid Language Adaptation from universal models Adaptation from universal models � Bridge the gap: language and technology experts Bridge the gap: language and technology experts � Technology experts do not speak all languages in question Technology experts do not speak all languages in question Native users are not in control of the technology Native users are not in control of the technology
Sharing data between modules Speech-to-Speech Translation L source L target Input L s Word → Word → Word s ↔ phone phone N-grams N-grams sequence sequence Word t AM s Dict s LM s Lex st LM t Dict t AM t Output L s Word → Word → Word s ↔ phone phone N-grams N-grams sequence Word t sequence Input L t AM t AM s Dict s LM s Lex ts LM t Dict t L source L target
SPICE S peech P rocessing: I nteractive C reation and E valuation toolkit • National Science Foundation, Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black • Bridge the gap between technology experts → language experts • Automatic Speech Recognition (ASR), • Machine Translation (MT), • Text-to-Speech (TTS) • Develop web-based intelligent systems • Interactive Learning with user in the loop • Rapid Adaptation of universal models to unseen languages • SPICE webpage http://cmuspice.org
Spice Project Page
Speech Processing Systems Phone set & Speech data Pronunciation rules Text data NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text
Rapid Portability: Data Phone set & Speech data + NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text
Finding “Nice” Prompts From very large text databases � From very large text databases � Find “nice” sentences: � Find “nice” sentences: � � Containing only high frequency words Containing only high frequency words � � 5 5- -15 words 15 words � Find grapheme/phoneme balanced set � Find grapheme/phoneme balanced set � � Select sentences with best Select sentences with best triphone triphone/graph /graph � 500- -1000 sentences 1000 sentences � 500 � Collect for ASR and TTS acoustic modeling � Collect for ASR and TTS acoustic modeling �
Prompt Selection Issues � Need good text Need good text � � De De- -htmlify htmlify, well , well- -written, no misspelling written, no misspelling � � Need word segmentation Need word segmentation � � Japanese, Chinese Thai Japanese, Chinese Thai � � Natural text is often mixed language Natural text is often mixed language � � Hindi Newspaper Text has lots of English words Hindi Newspaper Text has lots of English words � � Automatic selection has errors Automatic selection has errors � � Need Speaker to do further selection Need Speaker to do further selection � � E.g. lots of telephone numbers, E.g. lots of telephone numbers, formating formating commands commands � � CMU Arctic used similar methods CMU Arctic used similar methods �
Recording Prompts
GlobalPhone Multilingual Database � Widespread languages � Native Speakers � Uniform Data � Broad Domain � Large Text Resources � Internet, Newspaper Corpus � 19 Languages … counting Arabic Croatian Turkish � ≥ 1800 native speakers Ch-Mandarin Portuguese + Thai � ≥ 400 hrs Audio data Ch-Shanghai Russian + Creole German Spanish + Polish � Read Speech French Swedish + Bulgarian � Filled pauses annotated Japanese Tamil + ... ??? Now available from ELRA !! Korean Czech
Speech Recognition in 17 Languages 40 33.5 29 29 30 Word Error Rate [%] 23.4 21.7 16.9 18 19 20 20 20 20 14 14 14.514.5 11.8 10 10 0 English Ch-Mandarin Turkish German Thai French Portuguese Croatian Spanish Bulgarian Afrikaans Chinese Arabic Japanese Korean Russian Iraqi
Rapid Portability: Acoustic Models Phone set & Speech data + NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text
Universal Sound Inventory ⇒ IPA Speech Production is independent from Language 1) IPA-based Universal Sound Inventory 2) Each sound class is trained by data sharing � Reduction from 485 to 162 sound classes � m,n,s,l appear in all 12 languages � p,b,t,d,k,g,f and i,u,e,a,o in almost all B laukra ut k Problem: Bra utkle id lau k ra Br otkor b ut k le We inkar te ot k or Context of sounds are language specific in k ar -1=Plosiv? Context dependent models for new languages? N J k (0) +2=Vokal? Solution: N J k (1) k (2) 1) Multilingual Decision Context Trees lau k ra in k ar 2) Specialize decision tree by Adaptation ot k or ut k le
Choosing Phonemes
Rapid Portability: Acoustic Model 100 Ø Tree ML-Tree Po-Tree PDTS 80 69,1 Word Error rate [%] 57,1 60 49,9 40,6 40 32,8 28,9 19,6 19 20 0 0 0:15 0:15 0:25 0:25 0:25 1:30 16:30 +
Rapid Portability: Pronunciation Dictionary Pronunciation rules Textdaten � /a/ /d/ /i/ /o/ /s/ „adios“ � /h/ /a/ /l/ /o/ „Hallo“ � ??? „Phydough“ NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text
Recommend
More recommend