. University of Helsinki July 30, 2013 SLSP 2013 J. Nouri, L. Pivovarova, R. Yangarber (UH) July 30, 2013 SLSP 2013 Department of Computer Science Javad Nouri, Lidia Pivovarova, and Roman Yangarber . MDL-Based Models for Transliteration Generation . . . . . . 1 / 27
. . Etymon Project Models Prediction Pre-processing Evaluation Baselines . 4 Review Results . . 5 Future Work J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 Motivation Methods . Introduction . . . . Outline . . 1 Transliteration 3 Applications Challenges . 2 Data Data-sets . . 2 / 27
Таррагона Таррагон ﺎﻧﻮﮔارﺎﺗ הנוגרט Тарраґона ةنوغارط ट ै र ा ग ो न ा タ ラ ゴ ナ 塔拉戈纳 타라고나 ታ ራ ጎ ና த ா ர ா க ோணம ் July 30, 2013 SLSP 2013 J. Nouri, L. Pivovarova, R. Yangarber (UH) . . Taragona Tarragona or pronunciation Predicting word representation in another language, based on spelling Transliteration . . . . 3 / 27
. Tarragona July 30, 2013 SLSP 2013 J. Nouri, L. Pivovarova, R. Yangarber (UH) . Taragona 3 / 27 or pronunciation . Transliteration . . . Predicting word representation in another language, based on spelling Таррагона Таррагон æ ﺎﻧﻮﮔارﺎﺗ הנוגרט Тарраґона ةنوغارط ट ै र ा ग ो न ा タ ラ ゴ ナ 塔拉戈纳 타라고나 ታ ራ ጎ ና த ா ர ா க ோணம ்
. . . . . . Where can it be used? Machine Translation: proper names, terms Information Retrieval, Information Extraction, Named Entity Recognition Several scripts for a language J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 4 / 27
. . . . . . Challenges J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 5 / 27
. . . . . . Challenges Transliteration can be based on: pronunciation spelling translation J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 5 / 27
. . . . . . Challenges Transliteration can be based on: pronunciation spelling translation J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 5 / 27 Санкт-Петербург /Sankt-Peterburg/
. . July 30, 2013 SLSP 2013 J. Nouri, L. Pivovarova, R. Yangarber (UH) Pietari Saint Petersburg translation spelling pronunciation Transliteration can be based on: Challenges . . . . 5 / 27 Санкт-Петербург /Sankt-Peterburg/
. . . . . . Challenges Transliteration can be based on: pronunciation spelling translation Different transliterations for the same name J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 5 / 27
. . . . . . Leo Tolstoy Lev Tolstoy J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 5 / 27 Лев Толстой
. . . . . . Leo Tolstoy Lev Tolstoy J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 5 / 27 Лев Толстой
. . . . . . Leo Tolstoy Lev Tolstoy J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 5 / 27 Лев Толстой
. . . . . . Leo Tolstoy Lev Tolstoy J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 5 / 27 Лев Толстой
. Alphabetic July 30, 2013 SLSP 2013 J. Nouri, L. Pivovarova, R. Yangarber (UH) ... Katakana, ... Syllabic Arabic, Persian, Hebrew, ... Consonantal English, Russian, ... Scripts are based on different principles: . Different transliterations for the same name translation spelling pronunciation Transliteration can be based on: Challenges . . . . 5 / 27
. . Etymon Project Models Prediction Pre-processing Evaluation Baselines . 4 Review Results . . 5 Future Work J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 Motivation Methods . Introduction . . . . Outline . . 1 Transliteration 3 Applications Challenges . 2 Data Data-sets . . 6 / 27
. . . . . . Data J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 7 / 27
. . . . . . Data Titles devoted to the same entities in different languages are transliterations of each other. Language links can be used to find such pairs. Use categories to distinguish different types of data (locations, people, etc.) J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 7 / 27
. . . . . . Data Titles devoted to the same entities in different languages are transliterations of each other. Language links can be used to find such pairs. Use categories to distinguish different types of data (locations, people, etc.) J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 7 / 27
. . . . . . Data Titles devoted to the same entities in different languages are transliterations of each other. Language links can be used to find such pairs. Use categories to distinguish different types of data (locations, people, etc.) J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 7 / 27
. En–Fa 317 Ru–Jp 407 En–Gr 828 Ru–Fr 840 870 Ru–En Ru–Fa 1245 En–He 1136 Ru–En Russian Cities 1471 Russian Writes 1462 American Actors Fa–Ru July 30, 2013 SLSP 2013 J. Nouri, L. Pivovarova, R. Yangarber (UH) Locations are more homogeneous than person names. Russian writers contains mainly Russian names, etc. Using Wikipedia categories increases homogeneity in data 469 1893 French Cities Fa–Ru Iranian Locations 439 Fa–En Iranian Cities 828 Fr–Ru En–Ru of pairs . 1 French 3 Consonantal Farsi 2 Alphabetic English Script Type 4 Language # Data-sets . . . . Alphabetic Greek pair Data-set of pairs pair Size: # Language Data-set Size: # Language Alphabetic Alphabetic Russian 7 Syllabic Japanese (Katakana) 6 Consonantal Hebrew 5 8 / 27
. En–Fa 317 Ru–Jp 407 En–Gr 828 Ru–Fr 840 870 Ru–En Ru–Fa 1245 En–He 1136 Ru–En Russian Cities 1471 Russian Writes 1462 American Actors Fa–Ru July 30, 2013 SLSP 2013 J. Nouri, L. Pivovarova, R. Yangarber (UH) Locations are more homogeneous than person names. Russian writers contains mainly Russian names, etc. Using Wikipedia categories increases homogeneity in data 469 1893 French Cities Fa–Ru Iranian Locations 439 Fa–En Iranian Cities 828 Fr–Ru En–Ru of pairs . 1 French 3 Consonantal Farsi 2 Alphabetic English Script Type 4 Language # Data-sets . . . . Alphabetic Greek pair Data-set of pairs pair Size: # Language Data-set Size: # Language Alphabetic Alphabetic Russian 7 Syllabic Japanese (Katakana) 6 Consonantal Hebrew 5 8 / 27
. En–Fa 317 Ru–Jp 407 En–Gr 828 Ru–Fr 840 870 Ru–En Ru–Fa 1245 En–He 1136 Ru–En Russian Cities 1471 Russian Writes 1462 American Actors Fa–Ru July 30, 2013 SLSP 2013 J. Nouri, L. Pivovarova, R. Yangarber (UH) Locations are more homogeneous than person names. Russian writers contains mainly Russian names, etc. Using Wikipedia categories increases homogeneity in data 469 1893 French Cities Fa–Ru Iranian Locations 439 Fa–En Iranian Cities 828 Fr–Ru En–Ru of pairs . 1 French 3 Consonantal Farsi 2 Alphabetic English Script Type 4 Language # Data-sets . . . . Alphabetic Greek pair Data-set of pairs pair Size: # Language Data-set Size: # Language Alphabetic Alphabetic Russian 7 Syllabic Japanese (Katakana) 6 Consonantal Hebrew 5 8 / 27
. En–Fa 317 Ru–Jp 407 En–Gr 828 Ru–Fr 840 870 Ru–En Ru–Fa 1245 En–He 1136 Ru–En Russian Cities 1471 Russian Writes 1462 American Actors Fa–Ru July 30, 2013 SLSP 2013 J. Nouri, L. Pivovarova, R. Yangarber (UH) Locations are more homogeneous than person names. Russian writers contains mainly Russian names, etc. Using Wikipedia categories increases homogeneity in data 469 1893 French Cities Fa–Ru Iranian Locations 439 Fa–En Iranian Cities 828 Fr–Ru En–Ru of pairs . 1 French 3 Consonantal Farsi 2 Alphabetic English Script Type 4 Language # Data-sets . . . . Alphabetic Greek pair Data-set of pairs pair Size: # Language Data-set Size: # Language Alphabetic Alphabetic Russian 7 Syllabic Japanese (Katakana) 6 Consonantal Hebrew 5 8 / 27
. . Etymon Project Models Prediction Pre-processing Evaluation Baselines . 4 Review Results . . 5 Future Work J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 Motivation Methods . Introduction . . . . Outline . . 1 Transliteration 3 Applications Challenges . 2 Data Data-sets . . 9 / 27
. . . Combined . Phonetics-based . Spelling-based Spelling-based . . Hybrid . Extraction J. Nouri, L. Pivovarova, R. Yangarber (UH) SLSP 2013 July 30, 2013 MDL MDL . . . . . . Methods Karimi, S., Scholer, F., Turpin, A.: Machine transliteration survey. ACM Computing Surveys 43(3) (2011) . . Transliteration . Generation . Rule-based . Noisy Channel 10 / 27
Recommend
More recommend