Discovering the Vocabulary of a Language through Cross- Lingual Alignment Speaker: Felix Stahlberg Supervisors : Prof. Dr. rer. nat. Stephan Vogel, Prof. Dr. Ing. Tanja Schultz, Dipl. Inf. Tim Schlippe Cognitive Systems Lab Karlsruhe Institute of Technology KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft
SCENARIO 2 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Scenario Language barriers in international relief operations Languages and dialects without written form Rapid language adaption in trouble areas Dialects How to get training material for MT and ASR systems in such situations? 3 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Scenario Objective: Discover vocabulary only using spoken translations of a well-studied source language Say “I am sick.” in your Say “I am healthy.” in your mother tongue. mother tongue. /b/ /o/ /l/ /e/ /s/ /t/ /z/ /d/ /r/ /a/ /v/ /s/ /a/ /n/ /s/ /a/ /m/ /a/ /m/ • /b/ /o/ /l/ /e/ /s/ /t/ /a/ /n/ seems to be a word (meaning sick ) • /z/ /d/ /r/ /a/ /v/ seems to be a word (meaning healthy ) • /s/ /a/ /m/ seems to be a word (meaning I am ) 4 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Basic Idea (Source Language) English Sentence: I am sick How to find such an alignment? (Target Language) Phoneme /b/ /o/ /l/ /e/ /s/ /t/ /a/ /n/ /s/ /a/ /m/ Croatian sequence: Audio: 5 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
How to find an alignment? Adapt methods of Statistical Machine Translation Word alignments identify word pairs in bitexts that are translations of one another Expectation-Maximation approach (interactive with translation model (TM)) Choose best alignment Adjust TM parameters in terms of the TM regarding the alignment GIZA++: Freely available implementation of IBMs model hierachy and HMM model 6 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Challenges Spoken translations are the only available resources of the target language No target language phoneme recognizer is available Apply related language recognizer with uniformly distributed language model High phoneme error rates (PERs) Word alignment models from SMT are not designed for word-phoneme alignment Inaccurate alignments 7 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
EXPERIMENTAL SETUP 8 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
BMED B asic M edical E xpression D atabase 200 sentences in English, German, Slovene and Croatian 4h speech data of 5 Slovene and 8 Croatian native speakers recorded English German Croatian Slovene Ich habe starke Imam močen glavobol. I have bad headache. Imam jaku glavobolju. Kopfschmerzen. Imam vročino. I have a temperature. Ich habe Fieber. Imam temperaturu. Ich habe hier starke Imam zelo močne bolečine . I have bad pain here. Ovdje imam jake bolove. Schmerzen. Ozlijeđen sam. Poškodovan sem. I am wounded. Ich bin verletzt. I am wounded on my right Ich bin an meinem rechten Imam ozlijedu na desnoj Desno nogo imam poškodovano . leg. Bein verletzt. nozi. … 9 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
CorpusGong 10 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Recognizer Performance Here: Phoneme recognizer with Croatian acoustic models and phoneme set trained on 20h Croatian speech from GlobalPhone corpus Test Data PER Croatian GlobalPhone test set 33% Croatian BMED 43.4% Slovene BMED 55.2% 11 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
ALIGNMENT AND VOCABULARY EXTRACTION 12 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Example Corpus English Sentence Croatian Sentence Croatian Phonemes This is the wounded To je ozlijeden pacient. t o j e o z l i j e dp e n patient. p a ts i n t The patient is outside. Vani je pacient. v a n i j e p a ts i e n d The patient waits for Pacient ceka na p a ts i e m t tS e k a n operation. operaciju. a o p e r a ts i j u 13 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Alignment Step Run GIZA++ to generate alignments. Alignment This is patient the wounded Ranking Interpolation Extraction t o j e o z l i j e dp e n p a ts i n t The patient is outside The patient waits for operation v a n i j e p a ts i e n d p a ts i e m t tS e k a n a o p e r a ts i j u 14 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Alignment Step (2) Output of the alignment step: Alignment Source language word Aligned phoneme sequences Ranking Interpolation patient j e p a ts I n t, v e p a ts I n d, p a ts i e m t Extraction this t o is j e wounded o z l i j e dp e n outside a n i waits tS e k a for n a operation o p e r a ts i j u 15 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Ranking Step Idea: Rank all source language words, Alignment select n-best. Ranking Combine multiple aspects in the ranking Interpolation Frequency-Levenshtein Product Extraction Alignment Error Length Levenshtein Distance weight Alignment Score Frequency Phoneme Confidence 16 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Interpolation Step – 1 Suppose we want to get the Croatian Alignment phoneme sequence corresponding to „ patient ‟ given the following alignment Ranking Possibility 1 : Concatenate all aligned phonemes: v e p a ts i n d Interpolation Extraction The patient is outside v a n i j e p a ts i e n d 17 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Interpolation Step – 1 Suppose we want to get the Croatian Alignment phoneme sequence corresponding to „ patient ‟ given the following alignment Ranking Possibility 2 : Select the largest subsequence: e p a ts i Interpolation Extraction The patient is outside v a n i j e p a ts i e n d 18 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Interpolation Step – 1 Suppose we want to get the Croatian Alignment phoneme sequence corresponding to „ patient ‟ given the following alignment Ranking Possibility 3 : Use interval with minimum align error: Interpolation e p a ts i e n d Extraction The patient is outside v a n i j e p a ts i e n d Align error: 1 + 1 = 2 19 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Interpolation Step – Levenshtein Graph This is the wounded patient Alignment paTint Ranking t o j e o z l i j e dp e n p a ts i n t Interpolation The patient waits for operation Extraction paTiemt p a ts i e m t tS e k a n a o p e r a ts i j u patient is outside The epaTiend v a n i j e p a ts i e n d 20 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Interpolation Step – Centralities Alignment Ranking Interpolation Extraction Degree 3 PageRank .06 Closeness 7 Max. Dist. 4 Betweeness 1 21 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Interpolation Step – Centralities Degree 4 Alignment PageRank .08 Closeness 4 Ranking Max. Dist. 2 Betweeness 3 Interpolation Extraction 22 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Interpolation Step – Representative „ paTient “ with high rankings → representative Alignment Ranking Interpolation Extraction 23 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Extraction Step We extracted the new word ‚p a ts i e n t„ Alignment Refeed this information to the database Ranking Interpolation The patient is outside The is outside Extraction v a n i j e v a n i j e p a ts i e n d Alignment: p a ts i e n t (realign in next iteration) 24 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
Evaluation Metrics Find a mapping between entries in hypothesis and reference vocabulary For each hypothesis vocabulary entry Vocabulary Phoneme Error Rate (VPER) : Avg. Phoneme Error Rate to the reference entry For each sentence Abstract from phoneme recognition errors Word Error Rate (WER) : Avg. Word Error Rate between hypothesis and reference string hypoInRef : How many words are in the hypothesis, but not in the reference refInHypo : Other way round 25 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
RESULTS 26 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment
More recommend