typology igt
play

Typology & IGT Robin Westphal, 13.07.16 Institute for - PowerPoint PPT Presentation

HS: Computational Linguistics for Low- Resource Languages Typology & IGT Robin Westphal, 13.07.16 Institute for Computational The Online Database of Linguistics, University Heidelberg Interlinear Text Papers 3/36 Papers Developing


  1. HS: Computational Linguistics for Low- ‐ Resource Languages Typology & IGT Robin Westphal, 13.07.16 Institute for Computational The Online Database of Linguistics, University Heidelberg Interlinear Text

  2. Papers 3/36

  3. Papers Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World’s Languages (2006) Automatically Identifying Computationally Relevant Typological Features (2008) by William D. Lewis & Fei Xia 4/36

  4. Overview 5/36

  5. Overview ODIN: 1. What? 2. Why? 3. How? 4. Practical Use? 6/36

  6. 1. What is ODIN? 7/36

  7. What is ODIN? “ODIN is a database of interlinear text ‘snippets’, harvested mostly from scholarly documents posted to the web” Developed by: - GOLD Community of Practice (Farrar and Lewis, 2006) - Electronic Metastructure for Endangered Languages Data efforts (EMELD) 8/36

  8. 2. Why develop ODIN? 9/36

  9. Why develop ODIN? - Problem The web contains a vast amount of maintained data. BUT: - Spread everywhere - No uni-form search strategy - Cannot be easily manipulated or used 10/36

  10. Why develop ODIN? – Solution A database like ODIN provides: - Summary of most IGT instances on the web - Easy-to-use search-engine - A normative presentation for easier access 11/36

  11. What is IGT? 12/36

  12. Reminder: What is IGT? - “Interlinear Glossed Text” Source Gloss Translation 13/36 (Baylin, 2001)

  13. IGT – Challenging benefits Challenges - Unclear structural associoations between elements - Descriptions of grammatical concepts are inconsistent Benefits: - Consistent format for mining & enrichment 14/36

  14. 3. How to get all the data? 15/36

  15. How to get data? 1.) Find documents that could contain IGT. 2.) Detect & extract IGT via resembling patterns. 3.) Store in ODIN database. 16/36

  16. 3.1. Crawler 17/36

  17. Crawler Query Type (Top100) Avg no.docs Avg no. docs w/IGT Gram(s) 1,184 239 Language name(s) 1,314 259 Both grams and names 1,536 289 Language words 1,159 193 # of findings at the time of writing the article: 150.000 / 1,5 Million (10%) 18/36

  18. Crawler - Method 1 Regex approach: \t*(\()\d*\).*\n first line begins with a number in parentheses \t*.*\n second line can be anything \t*\ ’.* \n third line begins with a quote check first line with surrounding language codes 19/36

  19. Reminder: What is IGT? - “Interlinear Glossed Text” Source Gloss Translation (Baylin, 2001) 20/36

  20. Crawler - Method 1 - Problems - rigid formality - clusters of IGT with multiple languages are incorrectly identified - .PDF screws formats 21/36

  21. Crawler - Method 2 Machine Learning: - Tag each line based on a feature list convert the best tag sequence into a span sequence “B [ I | BL ]* E” - B = Begin I = Inside BL = Blank E = End O = Outside 22/36

  22. Crawler - Method 2 - Features Feature1 words of current line Feature2 collection of 16 IGT features (quotes, numbering, tokens) Feature3 tags for previous lines Feature4 tags for neighboring lines 23/36

  23. Crawler - Results Precision Recall F-score Regex 74,95% 52,19% 61,54% F2 57,02% 48,64% 52,50% F2+F4 75,50% 76,04% 75,77% F1+F2+F3+F4 82,29% 81,02% 81,65% 24/36

  24. 3.2. Converting raw data 25/36

  25. Language ID Problems for classifiers: - way too many languages to discern from - not enough training data for “rarer” languages - clusters of IGT with multiple languages 26/36

  26. Language ID - Features Feature1 nearest language code Feature2 neighboring language codes Feature3 n-grams in current IGT Feature4 n-grams in all IGT 83,08% accuracy for 7,816 language codes and 47,728 (code,name) pairs 27/36

  27. The final product 28/36

  28. German 29/36

  29. German 30/36

  30. 5. How is ODIN used? 31/36

  31. Usage - Searching via - Language name / code - Language family - Concept / Gram - Data enrichment - for English - for source language 32/36

  32. 5.1 Typology research 33/36

  33. Typology research – IGT enrichment Typology = study of classificating languages, by organising them in an enumerated list of possible types and identifying them via structural features Based on: ODIN data ->enriched source languages 34/36

  34. Typology research – IGT enrichment - parse the English translation using an English parser - align the target sentence and the English translation using the gloss line - project the phrase structures onto the target sentence Possible flaws: IGT / english bias (unnatural examples based on another language) 35/36

  35. Typology research - Features 36/36

  36. Typology research – Results & Error analysis 37/36

  37. Typology research - Results - Error analysis - Insufficient data - Skewed or inaccurate data - Projection error - Free constituent order 38/36

Recommend


More recommend