Speech Processing 15-492/18-492 Multilinguality
Dealing with *all* Languages Over 6000 Languages � Over 6000 Languages � � Maybe not all commercially interesting … now Maybe not all commercially interesting … now � Major languages (economic) � Major languages (economic) � � Cell phone manufacturers list 46 languages Cell phone manufacturers list 46 languages � � But even those not all covered But even those not all covered �
What you need ASR � ASR � � Acoustic model (lots of speakers) Acoustic model (lots of speakers) � � Pronunciation Lexicon Pronunciation Lexicon � � Language model Language model � TTS � TTS � � Acoustic model (one speaker) Acoustic model (one speaker) � � Pronunciation Lexicon Pronunciation Lexicon � � Text analysis Text analysis �
Writing Systems � Romanized writing systems Romanized writing systems � � Latin Latin- -1 (iso 1 (iso- -8599 8599- -1) 1) � � Covers many Western Europeans languages Covers many Western Europeans languages � � Cyrillic Cyrillic � � Covers many Eastern European Languages Covers many Eastern European Languages � � Arabic Scripts Arabic Scripts � � Arabic(s Arabic(s), Farsi, Urdu, etc ), Farsi, Urdu, etc � � Devenagari Devenagari � � Covers many Northern India Languages Covers many Northern India Languages � � Chinese Chinese Hanzi Hanzi � � Covers some Chinese dialects but different versions Covers some Chinese dialects but different versions � � Many other scripts some non Many other scripts some non- -standard standard �
Writing Systems � Letter based Letter based � � Latin, Cyrillic Latin, Cyrillic � � Consonant based Consonant based � � Arabic, Hebrew Arabic, Hebrew � � Mora based Mora based � � Half syllable or syllable Half syllable or syllable � � Indian scripts, Japanese native scripts Indian scripts, Japanese native scripts � � Syllable based Syllable based � � Hangul, Chinese Hangul, Chinese �
Standards Writing standards � Writing standards � � Taught at schools, newspapers, computer Taught at schools, newspapers, computer � support support � Typically standardized spelling Typically standardized spelling � May be mostly spoken � May be mostly spoken � � Occasionally written Occasionally written �
Language Specific Issues � No explicit markings No explicit markings � � Stress, accent, tones Stress, accent, tones � � No word boundaries No word boundaries � � Chinese, Thai Chinese, Thai � � No (short) vowels No (short) vowels � � Arabic, Hebrew Arabic, Hebrew � � Rich morphology Rich morphology � � Many different words in the languages Many different words in the languages � � Finnish, Turkish, Greenlandic Finnish, Turkish, Greenlandic �
Genre Specific Issues No capitals, punctuations � No capitals, punctuations � Unpunctuated � Unpunctuated � Plain vs vs polite form polite form � Plain � Speech vs vs text form text form � Speech � Many foreign phrases � Many foreign phrases � � (technology directed genre’s) (technology directed genre’s) � Many new abbreviations � Many new abbreviations � � E.g. SMS messages E.g. SMS messages �
Character Encoding � Unicode Unicode vs vs utf8 utf8 vs vs latin latin � � Documents mix them Documents mix them � � Sometime accent omitted Sometime accent omitted � � For ease of typing For ease of typing � � Lots of standards Lots of standards � � Unicode, EUC, BIG5, TIS42, … Unicode, EUC, BIG5, TIS42, … � � Everyone has their own standard Everyone has their own standard � � Some create their own standards Some create their own standards � � Mixed character sets Mixed character sets �
Phoneme Sets Hard to find consensus for new languages � Hard to find consensus for new languages � � Typically lots of different dialects Typically lots of different dialects � What level of distinction? � What level of distinction? � � Some good for speech but not really phonetic Some good for speech but not really phonetic � � /t/ /t/ vs vs / /dx dx/ in “water” / in “water” � Often doesn’t include foreign phones � Often doesn’t include foreign phones � � /w/ in German is common for younger people /w/ in German is common for younger people �
Words � May be hard to define May be hard to define � � No word boundaries No word boundaries � � Rich morphology Rich morphology � � Words have many variations of compounds Words have many variations of compounds � � Yomenakatta Yomenakatta - -> could not read > could not read � � Yomemasendeshita Yomemasendeshita - -> could not read (polite) > could not read (polite) � � Gender specific speech Gender specific speech � � Boku Boku vs vs atashi atashi � � Language mixtures Language mixtures �
Pronunciation lexicons “proper” speech proper” speech vs vs “actual” speech “actual” speech � “ � Hard to generalize � Hard to generalize � � Chinese Chinese � Cross lingual pronunciations � Cross lingual pronunciations � � “Human” (English/German) “Human” (English/German) �
“Industry” way � Collect at least 100 hours of spoken speech Collect at least 100 hours of spoken speech � � At least 20 different speakers At least 20 different speakers � � Mixture of gender, age, etc Mixture of gender, age, etc � � Through desired channel (phone/desktop) Through desired channel (phone/desktop) � � Collect at least 5 hours from one speaker Collect at least 5 hours from one speaker � � High quality recording studio High quality recording studio � � Data should be targeted to application Data should be targeted to application � � Build pronunciation lexicon Build pronunciation lexicon � � Expert Expert phonologist phonologist �
Industry way Probably 3- -6 months 6 months � Probably 3 � � Lead developer Lead developer � � Local language expert Local language expert � � Lots of human transcribers Lots of human transcribers � Costs? � Costs? � � Many hundreds of thousands Many hundreds of thousands �
Or cheaper (?) … Find existing data � Find existing data � � Linguistic Data Consortium ( Linguistic Data Consortium (UPenn UPenn) ) � � ELRA (European equivalent) ELRA (European equivalent) � � Appen Appen, Australia , Australia � � Find local people who have collected data Find local people who have collected data � Found data might be in wrong format � Found data might be in wrong format � � Data cleaning is often the most expensive Data cleaning is often the most expensive �
Actual way Often mixture � Often mixture � � Found data for initial model Found data for initial model � � Collect data with actual/initial application Collect data with actual/initial application �
Multilingual Systems Support lots of different languages � Support lots of different languages � � Press 1 for Spanish Press 1 for Spanish � � Press 2 for Gujarati … Press 2 for Gujarati … � Automatically detect language � Automatically detect language � Mixed language � Mixed language �
Multilingual (Menu) Speak in your language � Speak in your language � � Eki Eki- -mai mai no no tsugi tsugi no bus no ha? no bus no ha? � � When is the next bus to the station When is the next bus to the station � Need multiple recognizers � Need multiple recognizers � � Run in parallel and take best result Run in parallel and take best result � Or shared acoustic models � Or shared acoustic models � � Recognizing both languages at once (mix) Recognizing both languages at once (mix) �
Multilingual (in line) � Code switching Code switching � � European, India, Bilingual areas European, India, Bilingual areas � � Hinglish Hinglish, , Spanglish Spanglish � � Borrowed words and phrases Borrowed words and phrases � � Dad, time Dad, time kyu kyu hua hua hai hai � � One One lakh lakh � � Computer Computer walla walla � � numbers numbers � � Can be inflected Can be inflected � � Was updated Was updated - -> up > up gedaten gedaten �
Lilac
HW2: TTS Due 3:30pm Monday October 20 th th � Due 3:30pm Monday October 20 � Install Festival and Festvox Festvox � Install Festival and � Find 10 errors in each of two different � Find 10 errors in each of two different � synthesizers synthesizers Build a voice � Build a voice � � A Talking Clock A Talking Clock � � A general voice A general voice � � (or both) (or both) �
Recommend
More recommend