creating large scale
play

Creating Large-Scale Multilingual Cognate Tables Winston Wu and - PowerPoint PPT Presentation

Creating Large-Scale Multilingual Cognate Tables Winston Wu and David Yarowsky Center for Language and Speech Processing Johns Hopkins University http://educationviews.org/wp-content/uploads/2013/06/world-bread-cognates-panis.jpg Cognates and


  1. Creating Large-Scale Multilingual Cognate Tables Winston Wu and David Yarowsky Center for Language and Speech Processing Johns Hopkins University

  2. http://educationviews.org/wp-content/uploads/2013/06/world-bread-cognates-panis.jpg

  3. Cognates and Cognate Chains

  4. Data • Panlex and Wiktionary

  5. Cognate Table Construction Initial cluster with Alignment to get Cluster with unweighted edit lexical translation weighted distance distance probabilities function

  6. Clustering tuk: stol uig: ustel azj: stol tur: tablo uzn: stol tat: ostal tuk: tablisa tat: tablis uzn: tablista

  7. Bitext from Clusters eng azj tat tuk tur uig uzn table stol stol stol table ostal ustel table tablo table tablis tablisa tablista

  8. Alignment ü s t e l UIG o s t o l TAT t -> t 0.600 l -> l 0.747 h -> h 0.529 t -> d 0.098 l -> r 0.048 h -> u 0.150 t -> c 0.061 l -> n 0.024 h -> NULL 0.140 t -> r 0.057 l -> t 0.019 h -> l 0.048 t -> p 0.019 l -> o 0.018 h -> a 0.032 t -> s 0.017 l -> d 0.016 h -> j 0.019 t -> l 0.017 l -> c 0.015 h -> o 0.017 t -> n 0.015 l -> a 0.015 h -> k 0.015

  9. Clustering Distance Function • Language-pair-specific edit distance • Intra-family edit distance • Same backtranslation • Same POS • Same MeaningID

  10. Cognate Tables

  11. Experiments • Hold out words • Use MT to predict • Single language pair and system combination • Evaluate on 1-best, 10-best, MRR

  12. Results: Romance

  13. Results: Romance

  14. Results: Turkic

  15. Results: Turkic

  16. Results: Romance

  17. Results: Turkic

  18. Conclusion • Cluster-alignment-cluster process for multilingual cognate table construction • Experiments • 1-best exact match accuracy is hard! • Close languages tend to do better • Data size matters • Code and data at github.com/wswu/coglust

Recommend


More recommend