Creating Large-Scale Multilingual Cognate Tables Winston Wu and David Yarowsky Center for Language and Speech Processing Johns Hopkins University
http://educationviews.org/wp-content/uploads/2013/06/world-bread-cognates-panis.jpg
Cognates and Cognate Chains
Data • Panlex and Wiktionary
Cognate Table Construction Initial cluster with Alignment to get Cluster with unweighted edit lexical translation weighted distance distance probabilities function
Clustering tuk: stol uig: ustel azj: stol tur: tablo uzn: stol tat: ostal tuk: tablisa tat: tablis uzn: tablista
Bitext from Clusters eng azj tat tuk tur uig uzn table stol stol stol table ostal ustel table tablo table tablis tablisa tablista
Alignment ü s t e l UIG o s t o l TAT t -> t 0.600 l -> l 0.747 h -> h 0.529 t -> d 0.098 l -> r 0.048 h -> u 0.150 t -> c 0.061 l -> n 0.024 h -> NULL 0.140 t -> r 0.057 l -> t 0.019 h -> l 0.048 t -> p 0.019 l -> o 0.018 h -> a 0.032 t -> s 0.017 l -> d 0.016 h -> j 0.019 t -> l 0.017 l -> c 0.015 h -> o 0.017 t -> n 0.015 l -> a 0.015 h -> k 0.015
Clustering Distance Function • Language-pair-specific edit distance • Intra-family edit distance • Same backtranslation • Same POS • Same MeaningID
Cognate Tables
Experiments • Hold out words • Use MT to predict • Single language pair and system combination • Evaluate on 1-best, 10-best, MRR
Results: Romance
Results: Romance
Results: Turkic
Results: Turkic
Results: Romance
Results: Turkic
Conclusion • Cluster-alignment-cluster process for multilingual cognate table construction • Experiments • 1-best exact match accuracy is hard! • Close languages tend to do better • Data size matters • Code and data at github.com/wswu/coglust
Recommend
More recommend