Speech Processing 15-492/18-492 Multilinguality SPICE: making it - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Multilinguality SPICE: making it easier

Dealing with *all* Languages Over 6000 Languages � Over 6000 Languages � � Maybe not all commercially interesting … now Maybe not all commercially interesting … now � Major languages (economic) � Major languages (economic) � � Cell phone manufacturers list 46 languages Cell phone manufacturers list 46 languages � � But even those not all covered But even those not all covered �

Motivation � Computerization Computerization: Speech is key technology : Speech is key technology � � Mobile Devices, Ubiquitous Information Access Mobile Devices, Ubiquitous Information Access � � Globalization Globalization: : Multilinguality Multilinguality � � More than 6000 Languages in the world More than 6000 Languages in the world � � Multiple official languages Multiple official languages � � Europe has 20+ official languages Europe has 20+ official languages � � South Africa has 11 official languages South Africa has 11 official languages � ⇒ Speech Processing in multiple Languages Speech Processing in multiple Languages ⇒ � Cross Cross- -cultural Human cultural Human- -Human Interaction Human Interaction � � Human Human- -Machine Interface in mother tongue Machine Interface in mother tongue �

Challenges � Algorithms language independent but require data Algorithms language independent but require data �  Dozens of hours audio recordings and corresponding transcription Dozens of hours audio recordings and corresponding transcriptions s   Pronunciation dictionaries for large vocabularies (>100.000 word Pronunciation dictionaries for large vocabularies (>100.000 words) s)   Millions of words written text corpora in various domains in que Millions of words written text corpora in various domains in question stion   Bilingual aligned text corpora Bilingual aligned text corpora  � BUT: Such data only available in very few languages BUT: Such data only available in very few languages � Audio data ≤ ≤ 40  Audio data 40 languages, languages, Transcriptions take up to Transcriptions take up to 40x 40x real time real time  Large vocabulary pronunciation dictionaries ≤ ≤ 20  Large vocabulary pronunciation dictionaries 20 languages languages  Small text corpora ≤ ≤ 100 ≤ 30 large corpora ≤  Small text corpora 100 languages, languages, large corpora 30 languages languages   Bilingual corpora in very few language pairs, pivot mostly Engli Bilingual corpora in very few language pairs, pivot mostly English sh  � Additional complications: Additional complications: �  Combinatorical explosion Combinatorical explosion (domain, speaking style, accent, dialect, ...) (domain, speaking style, accent, dialect, ...)   Few native speakers at hand for minority (endangered) languages Few native speakers at hand for minority (endangered) languages   Languages without writing systems Languages without writing systems 

Solution: Learning Systems ⇒ Systems that learn a language from the user ⇒ Systems that learn a language from the user � Efficient Efficient learning algorithms for speech processing learning algorithms for speech processing � � Learning: Learning: �  Interactive learning with user in the loop Interactive learning with user in the loop   Statistical modeling approaches Statistical modeling approaches  � Efficiency: Efficiency: �  Reduce amount of data Reduce amount of data (save time and costs): by a factor of 10 (save time and costs): by a factor of 10   Speed up development cycles: Speed up development cycles: days rather than months days rather than months  ⇒ Rapid Language ⇒ Rapid Language Adaptation from universal models Adaptation from universal models � Bridge the gap: language and technology experts Bridge the gap: language and technology experts �  Technology experts do not speak all languages in question Technology experts do not speak all languages in question   Native users are not in control of the technology Native users are not in control of the technology 

Sharing data between modules Speech-to-Speech Translation L source L target Input L s Word → Word → Word s ↔ phone phone N-grams N-grams sequence sequence Word t AM s Dict s LM s Lex st LM t Dict t AM t Output L s Word → Word → Word s ↔ phone phone N-grams N-grams sequence Word t sequence Input L t AM t AM s Dict s LM s Lex ts LM t Dict t L source L target

SPICE S peech P rocessing: I nteractive C reation and E valuation toolkit • National Science Foundation, Grant 10/2004, 3 years • Principle Investigators Tanja Schultz and Alan Black • Bridge the gap between technology experts → language experts • Automatic Speech Recognition (ASR), • Machine Translation (MT), • Text-to-Speech (TTS) • Develop web-based intelligent systems • Interactive Learning with user in the loop • Rapid Adaptation of universal models to unseen languages • SPICE webpage http://cmuspice.org

Spice Project Page

Speech Processing Systems Phone set & Speech data Pronunciation rules Text data NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text

Rapid Portability: Data Phone set & Speech data + NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text

Finding “Nice” Prompts From very large text databases � From very large text databases � Find “nice” sentences: � Find “nice” sentences: � � Containing only high frequency words Containing only high frequency words � � 5 5- -15 words 15 words � Find grapheme/phoneme balanced set � Find grapheme/phoneme balanced set � � Select sentences with best Select sentences with best triphone triphone/graph /graph � 500- -1000 sentences 1000 sentences � 500 � Collect for ASR and TTS acoustic modeling � Collect for ASR and TTS acoustic modeling �

Prompt Selection Issues � Need good text Need good text � � De De- -htmlify htmlify, well , well- -written, no misspelling written, no misspelling � � Need word segmentation Need word segmentation � � Japanese, Chinese Thai Japanese, Chinese Thai � � Natural text is often mixed language Natural text is often mixed language � � Hindi Newspaper Text has lots of English words Hindi Newspaper Text has lots of English words � � Automatic selection has errors Automatic selection has errors � � Need Speaker to do further selection Need Speaker to do further selection � � E.g. lots of telephone numbers, E.g. lots of telephone numbers, formating formating commands commands � � CMU Arctic used similar methods CMU Arctic used similar methods �

Recording Prompts

GlobalPhone Multilingual Database � Widespread languages � Native Speakers � Uniform Data � Broad Domain � Large Text Resources � Internet, Newspaper Corpus � 19 Languages … counting Arabic Croatian Turkish � ≥ 1800 native speakers Ch-Mandarin Portuguese + Thai � ≥ 400 hrs Audio data Ch-Shanghai Russian + Creole German Spanish + Polish � Read Speech French Swedish + Bulgarian � Filled pauses annotated Japanese Tamil + ... ??? Now available from ELRA !! Korean Czech

Speech Recognition in 17 Languages 40 33.5 29 29 30 Word Error Rate [%] 23.4 21.7 16.9 18 19 20 20 20 20 14 14 14.514.5 11.8 10 10 0 English Ch-Mandarin Turkish German Thai French Portuguese Croatian Spanish Bulgarian Afrikaans Chinese Arabic Japanese Korean Russian Iraqi

Rapid Portability: Acoustic Models Phone set & Speech data + NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text

Universal Sound Inventory ⇒ IPA Speech Production is independent from Language 1) IPA-based Universal Sound Inventory 2) Each sound class is trained by data sharing � Reduction from 485 to 162 sound classes � m,n,s,l appear in all 12 languages � p,b,t,d,k,g,f and i,u,e,a,o in almost all B laukra ut k Problem: Bra utkle id lau k ra Br otkor b ut k le We inkar te ot k or Context of sounds are language specific in k ar -1=Plosiv? Context dependent models for new languages? N J k (0) +2=Vokal? Solution: N J k (1) k (2) 1) Multilingual Decision Context Trees lau k ra in k ar 2) Specialize decision tree by Adaptation ot k or ut k le

Choosing Phonemes

Rapid Portability: Acoustic Model 100 Ø Tree ML-Tree Po-Tree PDTS 80 69,1 Word Error rate [%] 57,1 60 49,9 40,6 40 32,8 28,9 19,6 19 20 0 0 0:15 0:15 0:25 0:25 0:25 1:30 16:30 +

Rapid Portability: Pronunciation Dictionary Pronunciation rules Textdaten � /a/ /d/ /i/ /o/ /s/ „adios“ � /h/ /a/ /l/ /o/ „Hallo“ � ??? „Phydough“ NLP Hello hi you hi /h//ai/ TTS you /j/u/ you are / we /w//i/ I am Input: Speech MT Output: AM Lex LM Speech & Text

Speech Processing 15-492/18-492 Multilinguality SPICE: making it - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Multilinguality SPICE: making it easier Dealing with all Languages Over 6000 Languages Over 6000 Languages Maybe not all commercially interesting now Maybe not all commercially interesting

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 11-492/18-492 Speech Synthesis Signal Processing Signal Manipulation

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Signal Representations Part 2: Speech Signal Processing Hsin-min Wang References: 1 X.

Speech Processing 15-492/18-492 Speech Recognition Grammars Other ASR techniques But not just

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Cepstral analysis in speech processing From speech production model, we have: s[n] = (p[n]*g[n] +

Human Speech Hermansky Spring 2020 EN.520.680 Speech and Auditory Processing by Humans and

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

SYNTAX PROCESSING Statistical Natural Language Processing 23.04.19 1 Syntax, Grammars, Parsing

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

End-to-End Speech Processing: From Pipeline to Integrated Architecture Shinji Watanabe Center

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Microphone Array Processing for Distant Speech Recognition From close-talking microphones to

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

Workshop on the Role of Speech in Developing Robust Speech Processing Applications May 7-8, 2015

KALDI GPU ACCELERATION GTC - March 2019 1) Brief introduction to speech processing 2) What we

FINITE STATE MORPHOLOGY 24.05.19 Statistical Natural Language Processing 1 Morphology with FSAs

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

Speech Processing 15-492/18-492 Multilinguality SPICE: making it - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Multilinguality SPICE: making it easier Dealing with *all* Languages Over 6000 Languages Over 6000 Languages Maybe not all commercially interesting now Maybe not all commercially interesting

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 11-492/18-492 Speech Synthesis Signal Processing Signal Manipulation

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Signal Representations Part 2: Speech Signal Processing Hsin-min Wang References: 1 X.

Speech Processing 15-492/18-492 Speech Recognition Grammars Other ASR techniques But not just

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Cepstral analysis in speech processing From speech production model, we have: s[n] = (p[n]*g[n] +

Human Speech Hermansky Spring 2020 EN.520.680 Speech and Auditory Processing by Humans and

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

SYNTAX PROCESSING Statistical Natural Language Processing 23.04.19 1 Syntax, Grammars, Parsing

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

End-to-End Speech Processing: From Pipeline to Integrated Architecture Shinji Watanabe Center

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Microphone Array Processing for Distant Speech Recognition From close-talking microphones to

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

Workshop on the Role of Speech in Developing Robust Speech Processing Applications May 7-8, 2015

KALDI GPU ACCELERATION GTC - March 2019 1) Brief introduction to speech processing 2) What we

FINITE STATE MORPHOLOGY 24.05.19 Statistical Natural Language Processing 1 Morphology with FSAs

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

Speech Processing 15-492/18-492 Multilinguality SPICE: making it easier Dealing with all Languages Over 6000 Languages Over 6000 Languages Maybe not all commercially interesting now Maybe not all commercially interesting

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and