Typology & IGT Robin Westphal, 13.07.16 Institute for - PowerPoint PPT Presentation

HS: Computational Linguistics for Low- ‐ Resource Languages Typology & IGT Robin Westphal, 13.07.16 Institute for Computational The Online Database of Linguistics, University Heidelberg Interlinear Text

Papers 3/36

Papers Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World’s Languages (2006) Automatically Identifying Computationally Relevant Typological Features (2008) by William D. Lewis & Fei Xia 4/36

Overview 5/36

Overview ODIN: 1. What? 2. Why? 3. How? 4. Practical Use? 6/36

1. What is ODIN? 7/36

What is ODIN? “ODIN is a database of interlinear text ‘snippets’, harvested mostly from scholarly documents posted to the web” Developed by: - GOLD Community of Practice (Farrar and Lewis, 2006) - Electronic Metastructure for Endangered Languages Data efforts (EMELD) 8/36

2. Why develop ODIN? 9/36

Why develop ODIN? - Problem The web contains a vast amount of maintained data. BUT: - Spread everywhere - No uni-form search strategy - Cannot be easily manipulated or used 10/36

Why develop ODIN? – Solution A database like ODIN provides: - Summary of most IGT instances on the web - Easy-to-use search-engine - A normative presentation for easier access 11/36

What is IGT? 12/36

Reminder: What is IGT? - “Interlinear Glossed Text” Source Gloss Translation 13/36 (Baylin, 2001)

IGT – Challenging benefits Challenges - Unclear structural associoations between elements - Descriptions of grammatical concepts are inconsistent Benefits: - Consistent format for mining & enrichment 14/36

3. How to get all the data? 15/36

How to get data? 1.) Find documents that could contain IGT. 2.) Detect & extract IGT via resembling patterns. 3.) Store in ODIN database. 16/36

3.1. Crawler 17/36

Crawler Query Type (Top100) Avg no.docs Avg no. docs w/IGT Gram(s) 1,184 239 Language name(s) 1,314 259 Both grams and names 1,536 289 Language words 1,159 193 # of findings at the time of writing the article: 150.000 / 1,5 Million (10%) 18/36

Crawler - Method 1 Regex approach: \t*(\()\d*\).*\n first line begins with a number in parentheses \t*.*\n second line can be anything \t*\ ’.* \n third line begins with a quote check first line with surrounding language codes 19/36

Reminder: What is IGT? - “Interlinear Glossed Text” Source Gloss Translation (Baylin, 2001) 20/36

Crawler - Method 1 - Problems - rigid formality - clusters of IGT with multiple languages are incorrectly identified - .PDF screws formats 21/36

Crawler - Method 2 Machine Learning: - Tag each line based on a feature list convert the best tag sequence into a span sequence “B [ I | BL ]* E” - B = Begin I = Inside BL = Blank E = End O = Outside 22/36

Crawler - Method 2 - Features Feature1 words of current line Feature2 collection of 16 IGT features (quotes, numbering, tokens) Feature3 tags for previous lines Feature4 tags for neighboring lines 23/36

Crawler - Results Precision Recall F-score Regex 74,95% 52,19% 61,54% F2 57,02% 48,64% 52,50% F2+F4 75,50% 76,04% 75,77% F1+F2+F3+F4 82,29% 81,02% 81,65% 24/36

3.2. Converting raw data 25/36

Language ID Problems for classifiers: - way too many languages to discern from - not enough training data for “rarer” languages - clusters of IGT with multiple languages 26/36

Language ID - Features Feature1 nearest language code Feature2 neighboring language codes Feature3 n-grams in current IGT Feature4 n-grams in all IGT 83,08% accuracy for 7,816 language codes and 47,728 (code,name) pairs 27/36

The final product 28/36

German 29/36

German 30/36

5. How is ODIN used? 31/36

Usage - Searching via - Language name / code - Language family - Concept / Gram - Data enrichment - for English - for source language 32/36

5.1 Typology research 33/36

Typology research – IGT enrichment Typology = study of classificating languages, by organising them in an enumerated list of possible types and identifying them via structural features Based on: ODIN data ->enriched source languages 34/36

Typology research – IGT enrichment - parse the English translation using an English parser - align the target sentence and the English translation using the gloss line - project the phrase structures onto the target sentence Possible flaws: IGT / english bias (unnatural examples based on another language) 35/36

Typology research - Features 36/36

Typology research – Results & Error analysis 37/36

Typology research - Results - Error analysis - Insufficient data - Skewed or inaccurate data - Projection error - Free constituent order 38/36

Typology & IGT Robin Westphal, 13.07.16 Institute for - PowerPoint PPT Presentation

HS: Computational Linguistics for Low- Resource Languages Typology & IGT Robin Westphal, 13.07.16 Institute for Computational The Online Database of Linguistics, University Heidelberg Interlinear Text Papers 3/36 Papers Developing

IGT Update XDC2018 Arkadiusz Hiler IGT GPU Tools - a collection of tests/tools for DRM drivers

Canonical Typology Danny Hieber Hieber, Daniel W. 2011. Canonical Typology. Talk given to the

ADMINISTRATION ARRANGEMENTS This Proposal seeks to undertake a review of the IGT UNC governance

RG004 REVIEW OF IGT GOVERNANCE AND ADMINISTRATION ARRANGEMENTS This Proposal seeks to

IGT GPU Tools THE PAST, THE PRESENT, THE FUTURE Arkadiusz Hiler @ FOSDEM 2019 Some Context IGT

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Language Typology and Areal Linguistics Yiru July 13, 2016 Yiru Language Typology July 13,

Intel GFX CI and IGT What services do we provide, our roadmaps, and lessons learnt! Martin Peres

An Invitation to Tropical Geometry Eva Maria Feichtner feichtne@igt.uni-stuttgart.de

A Holistic and Sustainable Care Center Project Typology - His istory ry & & Trends

Development of a Development of a Rural Typology GI S for Rural Typology GI S for Policy Makers

On the proper use of phylogenetic information in typology Gerhard Jger Tbingen University

Exploring the typology of quantity-insensitive stress systems without gradient constraints Jeff

Formal Concept Analysis Kow Kuroda meets grammar typology Medical School, Kyorin

Closing Package Update Jaime M. Saling May 7, 2019 The Issue: A Disclaimer of Opinion Since

Fitting Models for the Iowa Gambling Task Task with R Cognitive Modelling: EV and Other

State of the Intel Kernel Graphics Driver Daniel Vetter, Intel OTC LinuxTag Berlin 2014

Introduction US CMS is positioning itself to be able to learn, prototype and develop while

Industrial Gas Turbine Growth for A&D Companies April 7, 2016 Keith Flitner Global Accounts

Address Data Quality : Gas and Electricity : Cross-Code

From Database to Treebank: Enhancing a Hypertext Grammar with Grammar Engineering Emily M.

Machine Learning 2 DS 4420 / ML 2 Math review Byron C Wallace Probability Examples:

Sambuz

Useful Links

Newsletter

Mail Us

Typology & IGT Robin Westphal, 13.07.16 Institute for - PowerPoint PPT Presentation

HS: Computational Linguistics for Low- Resource Languages Typology & IGT Robin Westphal, 13.07.16 Institute for Computational The Online Database of Linguistics, University Heidelberg Interlinear Text Papers 3/36 Papers Developing

IGT Update XDC2018 Arkadiusz Hiler IGT GPU Tools - a collection of tests/tools for DRM drivers

Canonical Typology Danny Hieber Hieber, Daniel W. 2011. Canonical Typology. Talk given to the

ADMINISTRATION ARRANGEMENTS This Proposal seeks to undertake a review of the IGT UNC governance

RG004 REVIEW OF IGT GOVERNANCE AND ADMINISTRATION ARRANGEMENTS This Proposal seeks to

IGT GPU Tools THE PAST, THE PRESENT, THE FUTURE Arkadiusz Hiler @ FOSDEM 2019 Some Context IGT

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Language Typology and Areal Linguistics Yiru July 13, 2016 Yiru Language Typology July 13,

Intel GFX CI and IGT What services do we provide, our roadmaps, and lessons learnt! Martin Peres

An Invitation to Tropical Geometry Eva Maria Feichtner feichtne@igt.uni-stuttgart.de

A Holistic and Sustainable Care Center Project Typology - His istory ry &amp; &amp; Trends

Development of a Development of a Rural Typology GI S for Rural Typology GI S for Policy Makers

On the proper use of phylogenetic information in typology Gerhard Jger Tbingen University

Exploring the typology of quantity-insensitive stress systems without gradient constraints Jeff

Formal Concept Analysis Kow Kuroda meets grammar typology Medical School, Kyorin

Closing Package Update Jaime M. Saling May 7, 2019 The Issue: A Disclaimer of Opinion Since

Fitting Models for the Iowa Gambling Task Task with R Cognitive Modelling: EV and Other

State of the Intel Kernel Graphics Driver Daniel Vetter, Intel OTC LinuxTag Berlin 2014

Introduction US CMS is positioning itself to be able to learn, prototype and develop while

Industrial Gas Turbine Growth for A&amp;D Companies April 7, 2016 Keith Flitner Global Accounts

Address Data Quality : Gas and Electricity : Cross-Code

From Database to Treebank: Enhancing a Hypertext Grammar with Grammar Engineering Emily M.

Machine Learning 2 DS 4420 / ML 2 Math review Byron C Wallace Probability Examples:

Sambuz

Useful Links

Newsletter

Mail Us

A Holistic and Sustainable Care Center Project Typology - His istory ry & & Trends

Industrial Gas Turbine Growth for A&D Companies April 7, 2016 Keith Flitner Global Accounts