Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Semi-supervised Transliteration Mining from Parallel and Comparable Corpora Walid Aransa, Holger Schwenk, Loic Barrault LIUM - University of Le Mans, France firstname.lastname@lium.univ-lemans.fr Dec 7th 2012 IWSLT 2012, December 6-7, 2012, Hong Kong 1/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Outline 1 Introduction Transliteration Transliteration challenges Transliteration mining 2 Related work 3 Transliteration mining using parallel corpora - semi-supervised 4 Transliteration mining using comparable corpora - semi-supervised 5 Conclusion 2/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Introduction Transliteration is the process of writing a word (mainly proper noun) from one language in the alphabet of another language. It requires mapping the pronunciation of the word from the original language to the closest possible pronunciation in the target language The word and its transliteration are called a Transliteration Pair (TP) 3/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Transliteration applications Machine Translations: improve the word alignments, OOV Machine Transliterations: train statistical transliteration system Cross language Information Retrieval (IR): enrich the search results with orthographical variations Name Entity Recognition (NER) 4/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Transliteration challenges Examples: Transliteration from Arabic into English Some Arabic letters have no phonically equivalent letters in � � and � ) English (e.g. Some English letters do not have phonically equivalent letters in Arabic (e.g. v) Missing of short vowels (i.e. diacritics) in the Arabic text Some Arabic letters can be mapped to any letter from a group of phonically close English letters (e.g. � � to ” p or b” ) Some Arabic letters can be mapped to a sequence of English letters (e.g. � to ’kh’) 5/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Transliteration challenges - Cont Tokenization challenges: the Arabic name is concatenated to clitics like: Preposition ��� Conjunction � Both together (e.g. ���� ) Transliteration types: Forward: name is transliterated from its original language to another language Example: Arabic origin name ” ����� ”- > ” Mohamed” Backward: the transliterated names are transliterated back to the origin names in its original language Example: ” � ���� ”- > ” Bush” 6/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Transliteration mining (TM) The automatic extraction of TPs from parallel or comparable corpora is called Transliteration Mining (TM) Several methods to perform TM: Supervised Unsupervised Semi-supervised Some TM researches focus: Parallel corpora Comparable corpora 7/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Related work (Holmes et al., 2004) uses variant of the SOUNDEX methods and n-grams It improves precision and recall of name matching in the context of transliterated Arabic name search. (Darwish, 2010) presents two methods for improving TM, phonetic conflation of letters and iterative training of a transliteration model. The first method is an improved SOUNDEX phonetic algorithm. They propose SOUNDEX like conflation scheme to improve the recall and F-measure. Also iterative training method was presented that improves the recall but decreases the precision. 8/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work TM using parallel corpora - semi-supervised Parallel POS Tagging Text Word Alignment Ar En Preprocessing Preprocessing Statistical or Rule Based Transliteration Normalization System – Ar/En Trans Ar Similarity Normalization Scoring Transliteration Table- TT TPs Thresholds Ar-En 9/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co Figure: Extracting TPs from parallel corpora
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work TM algorithm for parallel corpora (1) The parallel corpus is tagged using a part-of-speech (POS) tagger. We used Stanford POS tagger for English and Mada/Tokan for Arabic POS tagging. (2) Align the tagged bitext using Giza++, using the source/target alignment file, remove all aligned word pairs with POS tags other than noun (NN) or proper noun (PNN) tags and remove all English words starting with lower-case letters. Words which have most lowest alignment scores are removed (about 5% from the total number of aligned word pairs). (3) Remove the POS tags from Arabic and English words. (4) Transliterate the Arabic word A into English using a rule based transliteration system (or a previously trained statistical based transliteration system). 10/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work TM algorithm for parallel corpora - Cont (5) Normalize the transliteration of Arabic word A t as well as the English word to Norm 1 , Norm 2 and Norm 3 as will be explained. The objective of the normalization is folding English letters with similar phonetic to the same letter or symbol. (6) For each aligned Arabic transliterated word A t and English word E, use their normalized forms to calculate the three levels of similarity scores which we store in a transliteration table (TT). (7) Extract TPs from the TT by applying a threshold on the three levels similarity scores. We selected the thresholds using empirical method shown later. 11/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Calculating the three levels of similarity scores Statistical or Rule Based Transliteration Ar word System – Ar/En Transliterated Ar word En word Norm Norm Norm Norm Norm Norm Form3 Form2 Form1 Form3 Form2 Form1 3 rd Level 2 nd Level 1 st Level Similarity Similarity Similarity Score Score Score Transliteration Table- TT TP Thresholds Ar-En 12/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Calculating the three levels of similarity scores - Cont (1) Norm 1 normalization function: folding English letters with similar phonetic to one letter or symbol. lower cased phonically equivalent consonants and vowels are folded to one letter e.g. p and b are normalized to b, v and f are normalized to f, i and e are normalized to e double consonants are replaced by one letter hyphen ” -”is inserted after the initial two letters ” al”which is the transliteration of Arabic article ” �� ” 13/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Calculating the three levels of similarity scores - Cont (2) Norm 2 normalization function: Using Norm 1 output Double vowels are replaced by one similar upper-case letter (i.e. ee is normalized to E) Remove non-initial and non-final vowels only if not followed by vowel or not preceded by vowel (3) Norm 3 normalization function: Using Norm 2 output, hyphen ” -” and vowels are removed. 14/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co
Recommend
More recommend