found in translation
play

FOUND IN TRANSLATION: Reconstructing Phylogenetic Language Trees - PowerPoint PPT Presentation

FOUND IN TRANSLATION: Reconstructing Phylogenetic Language Trees Reconstructing Phylogenetic Language Trees Reconstructing Phylogenetic Language Trees Reconstructing Phylogenetic Language Trees from Translations from Translations from


  1. FOUND IN TRANSLATION: Reconstructing Phylogenetic Language Trees Reconstructing Phylogenetic Language Trees Reconstructing Phylogenetic Language Trees Reconstructing Phylogenetic Language Trees from Translations from Translations from Translations from Translations Ella Rabinovich Ella Rabinovich Ella Rabinovich Ella Rabinovich 1,2 1,2 , Noam Ordan , Noam Ordan , Noam Ordan 3 , Noam Ordan 3 , , , , Shuly Shuly Shuly Wintner Shuly Wintner Wintner Wintner 2 1,2 1,2 3 3 2 2 2 1 IBM Research IBM Research – IBM Research IBM Research – Haifa, Israel – – Haifa, Israel Haifa, Israel Haifa, Israel 1 1 1 2 Department of Computer Science, University of Haifa, Israel Department of Computer Science, University of Haifa, Israel 2 Department of Computer Science, University of Haifa, Israel Department of Computer Science, University of Haifa, Israel 2 2 3 The Arab College for Education, Haifa, Israel The Arab College for Education, Haifa, Israel 3 The Arab College for Education, Haifa, Israel The Arab College for Education, Haifa, Israel 3 3 ACL ACL 2017 ACL ACL 2017, Vancouver 2017 2017 , Vancouver , Vancouver , Vancouver

  2. STARTING FROM THE END (spoiler � ) phylogenetic tree reconstructed from phylogenetic tree reconstructed from phylogenetic tree reconstructed from phylogenetic tree reconstructed from the Indo- the Indo -European phylogenetic tree European phylogenetic tree the Indo the Indo - - European phylogenetic tree European phylogenetic tree monolingual English texts translated monolingual English texts translated from from monolingual English monolingual English texts translated texts translated from from (the “ground truth”) (the “ground truth”) (the “ground truth”) (the “ground truth”) 17 IE languages 17 17 17 IE languages IE languages IE languages English Italian Swedish French Danish Spanish German German Dutch Dutch Romanian English French Swedish Italian Danish Spanish Romanian Portuguese Lithuanian Latvian Portuguese Lithuanian Czech Polish Slovak Slovak Bulgarian Czech Latvian Slovenian Polish Bulgarian Slovenian FOUND IN TRANSLATION: RECONSTRUCTING PHYLOGENETIC LANGUAGE TREES FROM TRANSLATIONS AUG 2017 2

  3. BACKGROUND – THE FEATURES OF TRANSLATIONESE • Translators Translators Translators Translators (almost) (almost) (almost) (almost) always tried to always tried to always tried to always tried to remain remain invisible remain remain invisible invisible invisible • Translations have unique characteristics that set them apart from originals Translations have unique characteristics that set them apart from originals Translations have unique characteristics that set them apart from originals Translations have unique characteristics that set them apart from originals Universals (simplification, standardization, Universals (simplification, standardization, Universals (simplification, standardization, Universals (simplification, standardization, explicitation explicitation explicitation explicitation) ) ) ) • Interference (the “fingerprints” of a source language on the translation Interference (the “fingerprints” of a source language on the translation product) product) Interference (the “fingerprints” of a source language on the translation Interference (the “fingerprints” of a source language on the translation product) product) • Languages Languages closer to each other closer to each other are likely to share more Languages Languages closer to each other closer to each other features in the target language of translation in the target language of translation in the target language of translation in the target language of translation HYPOTHESIS The distance between languages The distance between languages The distance between languages The distance between languages is retained and is retained and is retained and is retained and can can can can be be be be recovered recovered when when assessed through these features in recovered recovered when when translated texts FOUND IN TRANSLATION: RECONSTRUCTING PHYLOGENETIC LANGUAGE TREES FROM TRANSLATIONS AUG 2017 3

  4. DATASET Europarl (the proceedings of the European Parliament) • Members are allowed to speak in any of the EU Members are allowed to speak in any of Members are allowed to speak in any of Members are allowed to speak in any of the EU the EU the EU languages languages languages languages • All parliament speeches were translated from the original language into • other EU languages using English as a pivot Direct Direct translations into English, Direct Direct translations into English, translations into English, indirect translations into English, indirect indirect translations into all other languages indirect translations into all other languages translations into all other languages translations into all other languages • We explore indirect translations into French We explore indirect translations into French in this work in this work We explore We explore indirect translations into French indirect translations into French in this work in this work • We focus on 17 source languages, grouped into 3 language families • Germanic, Romance, and Balto Germanic, Romance, and Balto Germanic, Romance, and Balto Germanic, Romance, and Balto- - -Slavic - Slavic Slavic Slavic • FOUND IN TRANSLATION: RECONSTRUCTING PHYLOGENETIC LANGUAGE TREES FROM TRANSLATIONS AUG 2017 4

  5. RECONSTRUCTION OF LANGUAGE TREES FEATURES USED • POS POS- -trigrams, reflecting shallow syntactic trigrams, reflecting shallow syntactic structures structures POS POS - - trigrams, reflecting shallow syntactic trigrams, reflecting shallow syntactic structures structures (strongly (strongly associated with (strongly (strongly associated with associated with associated with interference interference interference) interference ) ) ) • Function words, reflecting grammar (associated with Function words, reflecting grammar (associated with interference Function words, reflecting grammar (associated with Function words, reflecting grammar (associated with interference) interference interference ) ) ) • Cohesive markers (associated with Cohesive markers (associated with Cohesive markers (associated with a Cohesive markers (associated with a translation universals a a translation universals) translation universals translation universals ) ) ) AGGLOMERATIVE (HIERARCHICAL) CLUSTERING OF FEATURE VECTORS • Using the variance minimization algorithm ( Using the variance minimization algorithm (Ward, Ward, 1963 1963) ) Using the variance minimization algorithm ( Using the variance minimization algorithm ( Ward, Ward, 1963 1963 ) ) → with Euclidean distance with Euclidean distance with Euclidean distance with Euclidean distance FOUND IN TRANSLATION: RECONSTRUCTING PHYLOGENETIC LANGUAGE TREES FROM TRANSLATIONS AUG 2017 5

  6. IDENTIFICATION OF TRANSLATIONESE AND ITS SOURCE LANGUAGE Feature English translations French translations ORIGINAL VS. 97.60 97.60 97.60 97.60 98.40 98.40 98.40 98.40 POS-trigrams TRANSLATED binary binary binary binary classification classification classification classification 96.45 96.45 96.45 96.45 95.15 95.15 95.15 95.15 function words 86.50 86.50 86.50 86.50 85.25 85.25 85.25 85.25 cohesive markers ENGLISH translations ( translations ( translations (76.5 translations ( 76.5%) 76.5 76.5 %) %) %) FRENCH translations ( translations (48.9 translations ( translations ( 48.9%) 48.9 48.9 %) %) %) CONFUSION MATRIX source- source source source -language - - language language language classification classification classification classification (POS- (POS (POS (POS -trigrams - - trigrams) trigrams trigrams ) ) ) FOUND IN TRANSLATION: RECONSTRUCTING PHYLOGENETIC LANGUAGE TREES FROM TRANSLATIONS AUG 2017 6

  7. RECONSTRUCTION OF LANGUAGE TREES Phylogenetic language trees generated with translated text generated with translated text generated with translated text generated with translated text (POS (POS (POS- (POS -trigrams) - - trigrams) trigrams) trigrams) Italian Italian French Spanish Spanish French German German Dutch Swedish English Dutch Swedish Danish Danish English Romanian Slovak Lithuanian Lithuanian Portuguese Latvian Czech Bulgarian Slovak Romanian Bulgarian Slovenian Latvian Portuguese Polish Polish Slovenian Czech ENGLISH translations translations translations translations FRENCH translations translations translations translations FOUND IN TRANSLATION: RECONSTRUCTING PHYLOGENETIC LANGUAGE TREES FROM TRANSLATIONS AUG 2017 7

  8. EVALUATION METHODOLOGY MEASURE SIMILARITY TO THE GOLD STANDARD UNWEIGHTED EVALUATION WEIGHTED EVALUATION (CLADORGRAM) (CLADORGRAM) (CLADORGRAM) (CLADORGRAM) (PHYLOGRAM) (PHYLOGRAM) (PHYLOGRAM) (PHYLOGRAM) assessing only structural assessing only structural assessing only structural assessing only structural assessing similarity based on both assessing similarity based on both assessing similarity based on both assessing similarity based on both (topological) similarity (topological) similarity (topological) similarity (topological) similarity structure and branching length structure and branching length structure and branching length structure and branching length CLADOGRAM PHYLOGRAM A A B B C C D D FOUND IN TRANSLATION: RECONSTRUCTING PHYLOGENETIC LANGUAGE TREES FROM TRANSLATIONS AUG 2017 8

Recommend


More recommend