Language change as a random walk in vector space Gerhard Jäger Tübingen University, Department of Linguistics Cluster Colloquium Machine Learning in Science Cluster of Excellence Machine Learning , Tübingen, July 23, 2019
Introduction 1 / 42
Language change and evolution Vater Unser im Himmel, geheiligt werde Dein Name Onze Vader in de Hemel, laat Uw Naam geheiligd worden Our Father in heaven, hallowed be your name Fader Vor, du som er i himlene! Helliget vorde dit navn 2 / 42
Language change and evolution 3 / 42
Language change and evolution Mittelhochdeutsch: Got vater unser, dâ du bist in dem himelrîche gewaltic alles des dir ist, geheiliget sô werde dîn nam Althochdeutsch: Fater unser thû thâr bist in himile, si giheilagôt thîn namo Gotisch: Atta unsar þu in himinam, weihnai namo þein 4 / 42
Convergent evolution • Old English docga > English dog • Proto-Paman *gudaga > Mbabaram dog (‘dog’) 5 / 42
Language phylogeny Comparative method 1 identifying cognates , i.e. obviously related morphemes in different languages, such as new/nowy , two/dwa , or water/voda 2 reconstruction of common ancestor and sound laws that explain the change from reconstructed to observed forms 3 applying this iteratively leads to phylogenetic language trees 6 / 42
Language phylogeny Scope of the method • reconstructed vocabulary shrinks with growing time depth • maximal time horizon seems to be about 8,000 years • grammatical morphemes and categories arguably more stable and less apt to borrowing • problem here: limited number of features, cross-linguistic variation constrained by language universals, frequently convergent evolution • comparative method is hard to apply in regions with high linguistic diversity and without written documents (Paleo-America, Papua) • tree structure might be inappropriate if there is a significant effect of language contact (cf. Australia) 7 / 42
Computational Methods • both cognate detection and tree construction lend themselves to algorithmic implementation • Advantages: • easy to scale up • comparability of results • affords statistical evaluation • Disadvantages: • cognacy judgments require lots of linguistic insight and experience • tree construction should be subject to historical (including archeological) and geographical plausibility 8 / 42
From words to trees Swadesh lists training pair-Hidden Markov Model sound similarities applying pair-Hidden Markov Model word alignments classification/ clustering cognate classes feature extraction character matrix Bayesian phylogenetic inference phylogenetic tree 9 / 42
From words to trees Swadesh lists training pair-Hidden Markov Model sound similarities applying pair-Hidden Markov Model word alignments classification/ clustering cognate classes feature extraction character matrix Bayesian phylogenetic inference phylogenetic tree 9 / 42
From words to trees Swadesh lists training pair-Hidden Markov Model sound similarities applying pair-Hidden Markov Model word alignments classification/ clustering cognate classes feature extraction character matrix Bayesian phylogenetic inference phylogenetic tree 9 / 42
From words to trees Swadesh lists training pair-Hidden Markov Model sound similarities applying pair-Hidden Markov Model word alignments classification/ clustering cognate classes feature extraction character matrix Bayesian phylogenetic inference phylogenetic tree 9 / 42
From words to trees Swadesh lists training pair-Hidden Markov Model sound similarities applying pair-Hidden Markov Model word alignments classification/ clustering cognate classes feature extraction character matrix Bayesian phylogenetic inference phylogenetic tree 9 / 42
From words to trees Swadesh lists training pair-Hidden Markov Model sound similarities applying pair-Hidden Markov Model word alignments classification/ clustering cognate classes feature extraction character matrix Bayesian phylogenetic inference phylogenetic tree 9 / 42
From words to trees n a Khoisan a r h n a a S i - d o v i l i a Altaic N r D c n l i a Niger-Congo a e Swadesh lists r p U o r u E o - d n I training pair-Hidden Markov Model Afro-Asiatic sound n NW Eurasia similarities a r a h applying a a s c b i u r pair-Hidden Markov Model A f n S a l a i Australia/Papua s t r A u orricelli T Sepik word alignments Trans-NewGuinea l l i P T o r r i c e a Trans-NewGuinea p u Trans-NewGuinea a a classification/ s i Trans-NewGuinea A clustering E n C h i b c h a S Otomanguean cognate classes n a k a a w A r o a n A P a n m A i n u e G e o - r a c Cariban r i M c feature extraction ucanoan a n T a p i u T Penutian Austronesian Algic e character matrix n D e n a a N e g u n a m o t Uto-Aztecan O Bayesian n a k o Mayan H phylogenetic n a i Salish inference n a n phylogenetic s t e a h u Hmong-Mien h Sino-Tibetan T g c ai-Kadai a e Timor-Alor-Pantar Austro-Asiatic tree D - u h Q k a N 9 / 42
From word lists to distances 10 / 42
The Automated Similarity Judgment Program • Project at MPI EVA in Leipzig around Søren Wichmann • covers more than 6,000 languages and dialects • basic vocabulary of 40 words for each language, in uniform phonetic transcription • freely available used concepts: I, you, we, one, two, person, fish, dog, louse, tree, leaf, skin, blood, bone, horn, ear, eye, nose, tooth, tongue, knee, hand, breast, liver, drink, see, hear, die, come, sun, star, water, stone, fire, path, mountain, night, full, new, name 11 / 42
Automated Similarity Judgment Project concept Latin English concept Latin English ego Ei nasus nos I nose tu yu dens tu8 you tooth nos wi liNgw ∼ E t3N we tongue unus w3n genu ni one knee duo tu manus hEnd two hand persona, homo pers3n pektus, mama brest person breast piskis fiS yekur liv3r fish liver dog kanis dag drink bibere drink louse pedikulus laus see widere si tree arbor tri hear audire hir leaf foly ∼ u* lif die mori dEi skin kutis skin come wenire k3m blood saNgw ∼ is bl3d sun sol s3n bone os bon star stela star horn kornu horn water akw ∼ a wat3r ear auris ir stone lapis ston eye okulus Ei fire iNnis fEir 12 / 42
Word distances • based on string alignment • baseline: Levenshtein alignment ⇒ count matches and mis-matches • too crude as it totally ignores sound correspondences 13 / 42
How well does normalized Levenshtein distance predict cognacy? 1.00 1.0 0.8 0.75 empirical probability of cognacy 0.6 cognate LDN no 0.50 yes 0.4 0.25 0.2 0.0 0.00 0.2 0.4 0.6 0.8 no yes cognate LDN 14 / 42
Problems • binary distinction: match vs. non-match • frequently genuin sound correspondences in cognates are missed: c v a i n a z 3 - - - f i S - - t u n - o s p i s k i s • corresponding sounds count as mismatches even if they are aligend correctly h a n t h a n t h E n d m a n o • substantial amount of chance similarities 15 / 42
Capturing sound correspondences • weighted alignment using P ointwise M utual I nformation (PMI, a.k.a. log-odds ): s ( a , b ) = log p ( a , b ) q ( a ) q ( b ) • p ( a , b ) : probability of sound a being etymologically related to sound b in a pair of cognates • q ( a ) : relative frequency of sound a • Needleman-Wunsch algorithm: given a matrix of pairwise PMI scores between individual symbols and two strings, it returns the alignment that maximizes the aggregate PMI score • but first we need to estimate p ( a , b ) and q ( a ) , q ( b ) for all soundclasses a and b • q ( a ) : relative frequency of occurence of segment a in all words in ASJP • p ( a , b ) : that’s a bit more complicated... 16 / 42
Substitution matrix for the ASJP data 1. identify large sample of pairs of closely related languages (using expert information or heuristics based on aggregated Levenshtein distance) An.NORTHERN_PHILIPPINES.CENTRAL_BONTOC An.SOUTHERN_PHILIPPINES.KAGAYANEN An.MESO-PHILIPPINE.NORTHERN_SORSOGON An.NORTHERN_PHILIPPINES.LIMOS_KALINGA WF.WESTERN_FLY.IAMEGA An.MESO-PHILIPPINE.CANIPAAN_PALAWAN WF.WESTERN_FLY.GAMAEWE An.NORTHWEST_MALAYO-POLYNESIAN.LAHANAN Pan.PANOAN.KASHIBO_BAJO_AGUAYTIA NC.BANTOID.LIFONGA Pan.PANOAN.KASHIBO_SAN_ALEJANDRO NC.BANTOID.BOMBOMA_2 AA.EASTERN_CUSHITIC.KAMBAATA_2 IE.INDIC.WAD_PAGGA AA.EASTERN_CUSHITIC.HADIYYA_2 IE.INDIC.TALAGANG_HINDKO ST.BAI.QILIQIAO_BAI_2 NC.BANTOID.LINGALA ST.BAI.YUNLONG_BAI NC.BANTOID.LIFONGA An.SULAWESI.MANDAR An.CENTRAL_MALAYO-POLYNESIAN.BALILEDO An.OCEANIC.RAGA An.CENTRAL_MALAYO-POLYNESIAN.PALUE An.SULAWESI.TANETE AuA.MUNDA.HO An.SAMA-BAJAW.BOEPINANG_BAJAU AuA.MUNDA.KORKU 17 / 42
Substitution matrix for the ASJP data 2. pick a concept and a pair of related languages at random • languages: Pen.MAIDUAN.MAIDU_KONKAU , Pen.MAIDUAN.NE_MAIDU • concept: one 3. find corresponding words from the two languages: • nisam , niSem 4. do Levenshtein alignment n i s a m n i S e m 5. for each sound pair, count number of correspondences • nn: 1; ii: 1; sS; 1; ae: 1; mm: 1 18 / 42
Recommend
More recommend