HS: Computational Linguistics for Low- ‐ Resource Languages Typology & IGT Robin Westphal, 13.07.16 Institute for Computational The Online Database of Linguistics, University Heidelberg Interlinear Text
Papers 3/36
Papers Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World’s Languages (2006) Automatically Identifying Computationally Relevant Typological Features (2008) by William D. Lewis & Fei Xia 4/36
Overview 5/36
Overview ODIN: 1. What? 2. Why? 3. How? 4. Practical Use? 6/36
1. What is ODIN? 7/36
What is ODIN? “ODIN is a database of interlinear text ‘snippets’, harvested mostly from scholarly documents posted to the web” Developed by: - GOLD Community of Practice (Farrar and Lewis, 2006) - Electronic Metastructure for Endangered Languages Data efforts (EMELD) 8/36
2. Why develop ODIN? 9/36
Why develop ODIN? - Problem The web contains a vast amount of maintained data. BUT: - Spread everywhere - No uni-form search strategy - Cannot be easily manipulated or used 10/36
Why develop ODIN? – Solution A database like ODIN provides: - Summary of most IGT instances on the web - Easy-to-use search-engine - A normative presentation for easier access 11/36
What is IGT? 12/36
Reminder: What is IGT? - “Interlinear Glossed Text” Source Gloss Translation 13/36 (Baylin, 2001)
IGT – Challenging benefits Challenges - Unclear structural associoations between elements - Descriptions of grammatical concepts are inconsistent Benefits: - Consistent format for mining & enrichment 14/36
3. How to get all the data? 15/36
How to get data? 1.) Find documents that could contain IGT. 2.) Detect & extract IGT via resembling patterns. 3.) Store in ODIN database. 16/36
3.1. Crawler 17/36
Crawler Query Type (Top100) Avg no.docs Avg no. docs w/IGT Gram(s) 1,184 239 Language name(s) 1,314 259 Both grams and names 1,536 289 Language words 1,159 193 # of findings at the time of writing the article: 150.000 / 1,5 Million (10%) 18/36
Crawler - Method 1 Regex approach: \t*(\()\d*\).*\n first line begins with a number in parentheses \t*.*\n second line can be anything \t*\ ’.* \n third line begins with a quote check first line with surrounding language codes 19/36
Reminder: What is IGT? - “Interlinear Glossed Text” Source Gloss Translation (Baylin, 2001) 20/36
Crawler - Method 1 - Problems - rigid formality - clusters of IGT with multiple languages are incorrectly identified - .PDF screws formats 21/36
Crawler - Method 2 Machine Learning: - Tag each line based on a feature list convert the best tag sequence into a span sequence “B [ I | BL ]* E” - B = Begin I = Inside BL = Blank E = End O = Outside 22/36
Crawler - Method 2 - Features Feature1 words of current line Feature2 collection of 16 IGT features (quotes, numbering, tokens) Feature3 tags for previous lines Feature4 tags for neighboring lines 23/36
Crawler - Results Precision Recall F-score Regex 74,95% 52,19% 61,54% F2 57,02% 48,64% 52,50% F2+F4 75,50% 76,04% 75,77% F1+F2+F3+F4 82,29% 81,02% 81,65% 24/36
3.2. Converting raw data 25/36
Language ID Problems for classifiers: - way too many languages to discern from - not enough training data for “rarer” languages - clusters of IGT with multiple languages 26/36
Language ID - Features Feature1 nearest language code Feature2 neighboring language codes Feature3 n-grams in current IGT Feature4 n-grams in all IGT 83,08% accuracy for 7,816 language codes and 47,728 (code,name) pairs 27/36
The final product 28/36
German 29/36
German 30/36
5. How is ODIN used? 31/36
Usage - Searching via - Language name / code - Language family - Concept / Gram - Data enrichment - for English - for source language 32/36
5.1 Typology research 33/36
Typology research – IGT enrichment Typology = study of classificating languages, by organising them in an enumerated list of possible types and identifying them via structural features Based on: ODIN data ->enriched source languages 34/36
Typology research – IGT enrichment - parse the English translation using an English parser - align the target sentence and the English translation using the gloss line - project the phrase structures onto the target sentence Possible flaws: IGT / english bias (unnatural examples based on another language) 35/36
Typology research - Features 36/36
Typology research – Results & Error analysis 37/36
Typology research - Results - Error analysis - Insufficient data - Skewed or inaccurate data - Projection error - Free constituent order 38/36
Recommend
More recommend