Is automatic cognate detection good enough for phylogenetic inference? Jena, CESC 2017 September 13, 2017 Rama, List, Wahle & Jäger cognate detection & phylogenetic inference CESC2017 1 / 19 Taraka Rama 1 , 2 , Johann-Mattis List 3 , Johannes Wahle 1 & Gerhard Jäger 1 1 Tübingen University, 2 Oslo University & 3 MPI Jena
Introduction Computational historical linguistics CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger ... patterns in language change discovery of statistical proto-languages automatic reconstruction of families homeland of language inferring time depth and classifjcation automated language 15 years massive progress within past 2 / 19 (Grollemund et al, 2015) (Bouckaert et al, 2012)
Introduction induces bias in favor of CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger The goal of automated cognate detection is to do this automatically families well-studied language not fully replicable Computational historical linguistics subjective labor intensive judgments on Swadesh lists manually coded cognate most work depends on Manual cognate judgements 3 / 19
Introduction induces bias in favor of CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger The goal of automated cognate detection is to do this automatically families well-studied language not fully replicable Computational historical linguistics subjective labor intensive judgments on Swadesh lists manually coded cognate most work depends on Manual cognate judgements 3 / 19
Materials 11479 81 Sino-Tibetean 1128 0.13 IELex (Dunn, 2012) 208 8694 52 Indo-European 2459 0.20 Rama, List, Wahle & Jäger cognate detection & phylogenetic inference CESC2017 110 (Peiros, 2004) Materials Div. Datasets Dataset Words Conc. Lang. Families Cog. ABVD Sino-Tibetean (Greenhill et al., 2008) 12414 210 100 Austronesian 3558 0.27 4 / 19
Materials organized via a genealogical classifjcation CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger available historical-comparative research[...]. (the Glottolog tree) that is based on families and dialects. [...] The languoids are Materials catalogue of the world’s languages, language Glottolog provides a comprehensive Glottolog Glottolog (Hammarström et al., 2015) Expert trees were obtained from Expert Trees 5 / 19 ( http://glottolog.org/ )
Automated Cognate Detection SVM CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger Implementation: pair feature vector describing this as cognate or not based on a A pair of words is classifjed 2017) Sofroniev, 2016; Jäger et al., approach (Jäger and Automated Cognate Detection Classifjcation based and Forkel (2016)) (2017) LexStat algorithm fjrst propose in List (2012) and then further enhanced in List (2014), List et al. (2016) and List et al. the algorithm is generally based on the alignment-based workfmow for cognate detections implemented as part of 6 / 19 http://www.evolaemp. LingPy ( lingpy.org ,List uni-tuebingen.de/ svmcc/
Automated Cognate Detection sequences are represented as multi-tiered structures which allows to CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger detection algorithm (Infomap, Rosvall and Bergstrom (2008)) agglomerative clustering procedure has been replaced by a community handle prosodic context annotated ( secondary alignment , List (2014)) LexStat alignment algorithm is sensitive for morpheme boundaries if they are agglomerated scores for both global and local alignment analyses are combined and linguistics language pair, modeling regular sound correspondences in classical scoring functions for alignments are computed individually for each LexStat 7 / 19
Automated Cognate Detection LexStat CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger 7 / 19 LexStat INPUT TOKENIZATION PREPROCESSING LOOP CORRESPONDENCE ATTESTED EXPECTED DETECTION USING PHONETIC DISTRIBUTION DISTRIBUTION ALIGNMENT LOG-ODDS ISTANCE D CALCULATION COGNATE CLUSTERING OUTPUT LexStat Algorithm (List 2014)
Automated Cognate Detection LexStat CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger List et al. (2017) 8 / 19 LexStat: Cognate Set Partitioning GERMAN ENGLISH A GREEK RUSSIAN B POLISH çeri çeri 1 çeri hant hant GREEK 0.00 0.72 0.69 0.73 0.77 hant hænd 2 GERMAN 0.72 0.00 0.03 0.91 0.70 hænd hænd ENGLISH 0.69 0.03 0.00 0.91 0.68 ruka ruka ruka RUSSIAN 0.72 0.91 0.91 0.00 0.20 3 r ɛ̃ŋ ka r ɛ̃ŋ ka 3 r ɛ̃ŋ ka POLISH 0.77 0.70 0.68 0.20 0.00 C D E ruka hant hant 0.30 0.27 0.10 2 r ɛ̃ŋ ka 0.97 çeri hant 0.28 0.32 hænd 0.28 0.80 0.10 hænd 0.23 0.30 0.31 0.80 r ɛ̃ŋ ka 0.31 0.97 çeri 1 r ɛ̃ŋ ka 0.27 3 0.32 çeri hænd ruka ruka
Automated Cognate Detection LexStat CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger List et al. (2017) 8 / 19 LexStat: Cognate Set Partitioning GERMAN ENGLISH A GREEK RUSSIAN B POLISH çeri çeri 1 çeri hant hant GREEK 0.00 0.72 0.69 0.73 0.77 hant hænd 2 GERMAN 0.72 0.00 0.03 0.91 0.70 hænd hænd ENGLISH 0.69 0.03 0.00 0.91 0.68 ruka ruka ruka RUSSIAN 0.72 0.91 0.91 0.00 0.20 3 r ɛ̃ŋ ka r ɛ̃ŋ ka 3 r ɛ̃ŋ ka POLISH 0.77 0.70 0.68 0.20 0.00 C D E ruka hant hant 0.30 0.27 0.10 2 r ɛ̃ŋ ka 0.97 çeri hant 0.28 0.32 hænd 0.28 0.80 0.10 hænd 0.23 0.30 0.31 0.80 r ɛ̃ŋ ka 0.31 0.97 çeri 1 r ɛ̃ŋ ka 0.27 3 0.32 çeri hænd ruka ruka
Automated Cognate Detection LexStat CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger List et al. (2017) 8 / 19 LexStat: Cognate Set Partitioning GERMAN ENGLISH A GREEK RUSSIAN B POLISH çeri çeri 1 çeri hant hant GREEK 0.00 0.72 0.69 0.73 0.77 hant hænd 2 GERMAN 0.72 0.00 0.03 0.91 0.70 hænd hænd ENGLISH 0.69 0.03 0.00 0.91 0.68 ruka ruka ruka RUSSIAN 0.72 0.91 0.91 0.00 0.20 3 r ɛ̃ŋ ka r ɛ̃ŋ ka 3 r ɛ̃ŋ ka POLISH 0.77 0.70 0.68 0.20 0.00 C D E ruka hant hant 0.30 0.27 0.10 2 r ɛ̃ŋ ka 0.97 çeri hant 0.28 0.32 hænd 0.28 0.80 0.10 hænd 0.23 0.30 0.31 0.80 r ɛ̃ŋ ka 0.31 0.97 çeri 1 r ɛ̃ŋ ka 0.27 3 0.32 çeri hænd ruka ruka
Automated Cognate Detection LexStat CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger List et al. (2017) 8 / 19 LexStat: Cognate Set Partitioning GERMAN ENGLISH A GREEK RUSSIAN B POLISH çeri çeri 1 çeri hant hant GREEK 0.00 0.72 0.69 0.73 0.77 hant hænd 2 GERMAN 0.72 0.00 0.03 0.91 0.70 hænd hænd ENGLISH 0.69 0.03 0.00 0.91 0.68 ruka ruka ruka RUSSIAN 0.72 0.91 0.91 0.00 0.20 3 r ɛ̃ŋ ka r ɛ̃ŋ ka 3 r ɛ̃ŋ ka POLISH 0.77 0.70 0.68 0.20 0.00 C D E ruka hant hant 0.30 0.27 0.10 2 r ɛ̃ŋ ka 0.97 çeri hant 0.28 0.32 hænd 0.28 0.80 0.10 hænd 0.23 0.30 0.31 0.80 r ɛ̃ŋ ka 0.31 0.97 çeri 1 r ɛ̃ŋ ka 0.27 3 0.32 çeri hænd ruka ruka
Automated Cognate Detection and Sofroniev, 2016) + CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger data cross-validation on training feature selection via candidate features LexStat similarity as seven features from (Jäger SVM Feature selection variable cognate (yes/no) as dependent data point each synonymous word pair is a Model selection via Wikimedia Commons SVM 9 / 19
Automated Cognate Detection and Sofroniev, 2016) + CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger data cross-validation on training feature selection via candidate features LexStat similarity as seven features from (Jäger SVM Feature selection variable cognate (yes/no) as dependent data point each synonymous word pair is a Model selection via Wikimedia Commons SVM 9 / 19
Automated Cognate Detection correlation between string similarity and CESC2017 cognate detection & phylogenetic inference Rama, List, Wahle & Jäger SVM linear kernel doculect similarity 10 / 19 mean word length Model selection measures of concept stability fjve informative features LexStat similarity SVM PMI similarity doculect similarity LexStat 1.00 0.75 0.50 0.25 0.00 PMI 30 20 10 0 -10 -20 -30 doculect similarity 8 6 value 4 2 mean word length 9 6 3 correlation 1.00 0.75 0.50 0.25 0.00 no yes cognate
Automated Cognate Detection 0.928 0.791 0.781 0.801 0.855 0.796 0.817 Sino-Tibetean 0.848 0.820 0.301 0.409 0.455 0.552 Rama, List, Wahle & Jäger cognate detection & phylogenetic inference CESC2017 Austronesian 0.817 Comparison SVM Performance of the ACD Methods All scores are B-Cubed scores (Bagga and Baldwin, 1998) dataset Precision Recall F-score LexStat LexStat 0.770 SVM LexStat SVM Indo-European 0.896 0.877 0.750 11 / 19
Recommend
More recommend