Language Typology and Areal Linguistics Yiru July 13, 2016 Yiru Language Typology July 13, 2016 1 / 26
Overview Introduction 1 Typologically-based Clusters 2 Areal Linguistics 3 Yiru Language Typology July 13, 2016 2 / 26
Language similarity Why are some languages more alike than others? the languages may be related ”genetically”. derived from a common ancestor language the similarities may be due to chance. linguistic universals the languages may be related areally. due to sharing Yiru Language Typology July 13, 2016 3 / 26
Language similarity Differences between the concepts of genetic relatedness and language similarities lead us to the following questions: If we cluster languages based only on their typological features, how do the induced clusters compare to phylogenetic groupings? How well do induced clusters and genetic families perform in predicting values for typological features? What typological features tend to stay the same within language families, and what features are likely to differ? Yiru Language Typology July 13, 2016 4 / 26
WALS(World Atlas of Language Structures) The WALS project consists of a database that catalogs linguistic features for over 2,556 languages in 208 language families using 142 features in 11 different categories. Data sparsity: only 16% of the cells are filled -presents serious problems to clustering algorithms Yiru Language Typology July 13, 2016 5 / 26
Pruning Methods Pruning the data to produce a smaller but denser subset Prune Languages by Minimum Features require languages have a minimum of 25 features for the whole-world set, or 10 features for comparing across subfamilies Prune Features by Minimum Coverage pruning features that do not cover more than 10% of the selected languages in the whole-world set, and 25% in comparisons across subfamilies. Use a Dense Language Family Yiru Language Typology July 13, 2016 6 / 26
Features and Feature Values the actual representation of the features values cannot be treated using distance measures: Binarization Yiru Language Typology July 13, 2016 7 / 26
Experimental Setup Q1: how do induced clusters compare to phylogenetic groupings? Clustering Methods k-medoids algorithm methods from the CLUTO: repeated-bisection (rb), a k-means implementation (direct), an agglomerative algorithm (agglo) using UPGMA to produce hierarchical clusters, and bagglo, a variant of agglo Similarity Measures CLUTOs default cosine similarity measure (cos) # FeatureswithSameValues shared overlap = # FeaturesBothFilledOutinWALS Yiru Language Typology July 13, 2016 8 / 26
Clustering Performance Metrics The genetic families as the gold standard Rand Index Cluster Precision, Recall, and F-Score Yiru Language Typology July 13, 2016 9 / 26
Prediction Accuracy Q2: how do induced clusters and genetic families compare in predicting the values of features for languages in the same group? Q3: what typological features tend to stay the same within related families? Prediction accuracy: use 90% of the filled cells to build clusters predicted the values of the remaining 10% of filled cells the accuracy is calculated by comparing these predicted values with the actual values in the gold standard Yiru Language Typology July 13, 2016 10 / 26
Results & Analysis Cluster Similarity Yiru Language Typology July 13, 2016 11 / 26
Results & Analysis Prediction Accuracy Yiru Language Typology July 13, 2016 12 / 26
Results & Analysis Prediction Accuracy Yiru Language Typology July 13, 2016 13 / 26
Results & Analysis Feature Selection Yiru Language Typology July 13, 2016 14 / 26
Error Analysis Language Similarity vs. Genetic Yiru Language Typology July 13, 2016 15 / 26
Error Analysis WALS as the Dataset The Feature Set in WALS Data Sparsity and Shared Features Yiru Language Typology July 13, 2016 16 / 26
Areal Linguistics The use of areas improves genetic reconstruction of languages according to a variety of metrics. Basic ideas: develop a Bayesian model of typology that allows for the existence of linguistic areas preference for some feature to be shared areally to show that reconstructing language family trees is significantly aided by knowledge of areal features Yiru Language Typology July 13, 2016 17 / 26
Areal Linguistics some of the well-known linguistic areas The Balkans: Albanian, Bulgarian, Greek, Macedonian, Rumanian and Serbo-Croatian. (Sometimes: Romani and Turkish) The Baltic: Baltic languages, Baltic German, and Finnic languages (especially Estonian and Livonian). linguistic features most easily shared areally Ross (1988): nouns > verbs > adjectives > syntax > non − boundfunctionwords > boundmorphemes > phonemes Curnow (2001): 15 categories of borrowable features, phonetics (rare), phonology (common), lexical (very common) Yiru Language Typology July 13, 2016 18 / 26
A Bayesian Model for Areal Linguistics Pitman-Yor process for modeling linguistic areas Kingmans coalescent for modeling linguistic phylogeny Yiru Language Typology July 13, 2016 19 / 26
Identifying Language Areas 2 Yiru Language Typology July 13, 2016 20 / 26
Identifying Areal Features Yiru Language Typology July 13, 2016 21 / 26
Genetic Reconstruction Yiru Language Typology July 13, 2016 22 / 26
Genetic Reconstruction Yiru Language Typology July 13, 2016 23 / 26
Conclusion 1. Comparing clusters derived from typological features to genetic groups in the worlds languages the induced clusters look very different from genetic grouping despite the differences, induced clusters show similar, or even greater levels of typological similarity than genetic grouping 2. The use of areas improves genetic reconstruction of languages Yiru Language Typology July 13, 2016 24 / 26
References Ryan Georgi, Fei Xia, William Lewis (2001) Comparing Language Similarity across Genetic and Typologically-Based Groupings Hal Daume III(2009) Non-Parametric Bayesian Areal Linguistics Yiru Language Typology July 13, 2016 25 / 26
Thank You! Yiru Language Typology July 13, 2016 26 / 26
Recommend
More recommend