Speech Processing 15-492/18-495 Multilinguality
Dealing with *all* Languages Dealing with *all* Languages Over 6000 Languages Over 6000 Languages Maybe not all commercially interesting … now Maybe not all commercially interesting … now Major languages (economic) Major languages (economic) Cell phone manufacturers list 46 languages Cell phone manufacturers list 46 languages But even those not all covered But even those not all covered
What you need What you need ASR ASR Acoustic model (lots of speakers) Acoustic model (lots of speakers) Pronunciation Lexicon Pronunciation Lexicon Language model Language model TTS TTS Acoustic model (one speaker) Acoustic model (one speaker) Pronunciation Lexicon Pronunciation Lexicon Text analysis Text analysis
Writing Systems Writing Systems Romanized writing systems Romanized writing systems Latin-1 (iso-8599-1) Latin-1 (iso-8599-1) Covers many Western Europeans languages Covers many Western Europeans languages Cyrillic Cyrillic Covers many Eastern European Languages Covers many Eastern European Languages Arabic Scripts Arabic Scripts Arabic(s), Farsi, Urdu, etc Arabic(s), Farsi, Urdu, etc Devenagari Devenagari Covers many Northern India Languages Covers many Northern India Languages Chinese Hanzi Chinese Hanzi Covers some Chinese dialects but different versions Covers some Chinese dialects but different versions Many other scripts some non-standard Many other scripts some non-standard
Writing Systems Writing Systems Letter based Letter based Latin, Cyrillic Latin, Cyrillic Consonant based Consonant based Arabic, Hebrew Arabic, Hebrew Mora based Mora based Half syllable or syllable Half syllable or syllable Indian scripts, Japanese native scripts Indian scripts, Japanese native scripts Syllable based Syllable based Hangul, Chinese Hangul, Chinese
Standards Standards Writing standards Writing standards Taught at schools, newspapers, computer Taught at schools, newspapers, computer support support Typically standardized spelling Typically standardized spelling May be mostly spoken May be mostly spoken Occasionally written Occasionally written
Language Specific Issues Language Specific Issues No explicit markings No explicit markings Stress, accent, tones Stress, accent, tones No word boundaries No word boundaries Chinese, Thai Chinese, Thai No (short) vowels No (short) vowels Arabic, Hebrew Arabic, Hebrew Rich morphology Rich morphology Many different words in the languages Many different words in the languages Finnish, Turkish, Greenlandic Finnish, Turkish, Greenlandic
Genre Specific Issues Genre Specific Issues No capitals, punctuations No capitals, punctuations Unpunctuated Unpunctuated Plain vs polite form Plain vs polite form Speech vs text form Speech vs text form Many foreign phrases Many foreign phrases (technology directed genre’s) (technology directed genre’s) Many new abbreviations Many new abbreviations E.g. SMS messages E.g. SMS messages
Character Encoding Character Encoding Unicode vs utf8 vs latin Unicode vs utf8 vs latin Documents mix them Documents mix them Sometime accent omitted Sometime accent omitted For ease of typing For ease of typing Lots of standards Lots of standards Unicode, EUC, BIG5, TIS42, … Unicode, EUC, BIG5, TIS42, … Everyone has their own standard Everyone has their own standard Some create their own standards Some create their own standards Mixed character sets Mixed character sets
Phoneme Sets Phoneme Sets Hard to find consensus for new languages Hard to find consensus for new languages Typically lots of different dialects Typically lots of different dialects What level of distinction? What level of distinction? Some good for speech but not really phonetic Some good for speech but not really phonetic /t/ vs /dx/ in “water” /t/ vs /dx/ in “water” Often doesn’t include foreign phones Often doesn’t include foreign phones /w/ in German is common for younger people /w/ in German is common for younger people
Words Words May be hard to define May be hard to define No word boundaries No word boundaries Rich morphology Rich morphology Words have many variations of compounds Words have many variations of compounds Yomenakatta -> could not read Yomenakatta -> could not read Yomemasendeshita -> could not read (polite) Yomemasendeshita -> could not read (polite) Gender specific speech Gender specific speech Boku vs atashi Boku vs atashi Language mixtures Language mixtures
Pronunciation lexicons Pronunciation lexicons “ “proper” speech vs “actual” speech proper” speech vs “actual” speech Hard to generalize Hard to generalize Chinese Chinese Cross lingual pronunciations Cross lingual pronunciations “ “Human” (English/German) Human” (English/German)
“Industry” way Industry” way “ Collect at least 300 hours of spoken speech Collect at least 300 hours of spoken speech At least 20 different speakers At least 20 different speakers Mixture of gender, age, etc Mixture of gender, age, etc Through desired channel (phone/desktop) Through desired channel (phone/desktop) Collect at least 5 hours from one speaker Collect at least 5 hours from one speaker High quality recording studio High quality recording studio Data should be targeted to application Data should be targeted to application Build pronunciation lexicon Build pronunciation lexicon Expert phonologist Expert phonologist
Industry way Industry way Probably 3-6 months Probably 3-6 months Lead developer Lead developer Local language expert Local language expert Lots of human transcribers Lots of human transcribers Costs? Costs? Many hundreds of thousands Many hundreds of thousands
Or cheaper (?) … Or cheaper (?) … Find existing data Find existing data Linguistic Data Consortium (UPenn) Linguistic Data Consortium (UPenn) ELRA (European equivalent) ELRA (European equivalent) Appen, Australia Appen, Australia Find local people who have collected data Find local people who have collected data Found data might be in wrong format Found data might be in wrong format Data cleaning is often the most expensive Data cleaning is often the most expensive
Standardized Datasets Standardized Datasets Global Phone Global Phone – 20+ languages, for ASR/TTS 20+ languages, for ASR/TTS LDC/DARPA/IARPA sets LDC/DARPA/IARPA sets – Mostly English, Arabic and Chinese Mostly English, Arabic and Chinese BABEL dataset BABEL dataset – 35 low resource languages (telephone conversations) 35 low resource languages (telephone conversations) Librivox Librivox – Audio books Audio books Voxforge Voxforge – Open source collected languages Open source collected languages Mozilla Mozilla – Open source multilingual sets Open source multilingual sets
CMU Wilderness Dataset CMU Wilderness Dataset 500+ Languages 500+ Languages – 20 hours aligned for each language 20 hours aligned for each language – Single speaker Single speaker – Mined from read audio books (Bible) Mined from read audio books (Bible) – 20+ languages, for ASR/TTS 20+ languages, for ASR/TTS
Actual way Actual way Often mixture Often mixture Found data for initial model Found data for initial model Collect data with actual/initial application Collect data with actual/initial application
Multilingual Systems Multilingual Systems Support lots of different languages Support lots of different languages Press 1 for Spanish Press 1 for Spanish Press 2 for Gujarati … Press 2 for Gujarati … Automatically detect language Automatically detect language Mixed language Mixed language
Multilingual (Menu) Multilingual (Menu) Speak in your language Speak in your language Eki-mai no tsugi no bus no ha? Eki-mai no tsugi no bus no ha? When is the next bus to the station When is the next bus to the station Need multiple recognizers Need multiple recognizers Run in parallel and take best result Run in parallel and take best result Or shared acoustic models Or shared acoustic models Recognizing both languages at once (mix) Recognizing both languages at once (mix)
Multilingual (in line) Multilingual (in line) Code switching Code switching European, India, Bilingual areas European, India, Bilingual areas Hinglish, Spanglish Hinglish, Spanglish Borrowed words and phrases Borrowed words and phrases Dad, time kyu hua hai Dad, time kyu hua hai One lakh One lakh Computer walla Computer walla numbers numbers Can be inflected Can be inflected Was updated -> up gedaten Was updated -> up gedaten
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Multilinguality SPICE: making it easier
Recommend
More recommend