Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U - PowerPoint PPT Presentation

T HE PROJECT U SING C ZECH TO PARSE L ATIN Something from nothing Arne Skjærholt LTG seminar

T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GOAL ◮ Integrating linguistic and data-driven methods ◮ Use linguistic knowledge to guide data-driven methods ◮ Leverage data-driven approaches to inform linguistic and rule-driven methods ?

T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front

T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front ◮ Norwegian ◮ Decent resources at word-level ◮ No syntactic resources

T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front ◮ Norwegian ◮ Decent resources at word-level ◮ No syntactic resources ◮ Latin ◮ Long tradition of linguistic inquiry ◮ Quality and quantity of annotated data extremely variable

T HE PROJECT U SING C ZECH TO PARSE L ATIN P LANS ◮ Dependency corpus adaptation ◮ Constrained CRF models ◮ Annotation studies

T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it

T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it 3. ???

T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it 3. ??? 4. Profit!

T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus

T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus 2. Train language model over target language PoS sequences 3. Filter source corpus with LM

T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus 2. Train language model over target language PoS sequences 3. Filter source corpus with LM 4. Train model, parse target

T HE PROJECT U SING C ZECH TO PARSE L ATIN C ORPORA ◮ Prague Dependency Treebank (PDT) ◮ 1.5M tokens ◮ Dependency syntax and complex morphological annotation

T HE PROJECT U SING C ZECH TO PARSE L ATIN C ORPORA ◮ Prague Dependency Treebank (PDT) ◮ 1.5M tokens ◮ Dependency syntax and complex morphological annotation ◮ Latin Dependency Treebank (LDT) ◮ 53,143 tokens ◮ Annotation scheme based on PDT

T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSING L ATIN ◮ Previous baseline: MSTParser, 65% unlabelled, 53% labelled accuracy (Bamman & Crane 2008) ◮ New baseline: MSTParser, 64% unlabelled, 54% labelled

T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSING L ATIN ◮ Previous baseline: MSTParser, 65% unlabelled, 53% labelled accuracy (Bamman & Crane 2008) ◮ New baseline: MSTParser, 64% unlabelled, 54% labelled Prose 40,884 Poetry 12,259 Prose/poetry distribution

T HE PROJECT U SING C ZECH TO PARSE L ATIN W ORKFLOW PDT LDT reformat reformat CoNLL CoNLL tagset map tagset map Common tagset(s) Common tagset(s) delexicalise delexicalise Delexicalised Delexicalised filter train train Parser LM Parse Latin

T HE PROJECT U SING C ZECH TO PARSE L ATIN T AGSETS ◮ LDT annotation guidelines derived from PDT ◮ PoS mappings: ◮ LDT has a participle tag ◮ Czech has particles, Latin doesn’t

T HE PROJECT U SING C ZECH TO PARSE L ATIN T AGSETS ◮ LDT annotation guidelines derived from PDT ◮ PoS mappings: ◮ LDT has a participle tag ◮ Czech has particles, Latin doesn’t ◮ Deprel mappings: ◮ Reflexive tantum ◮ Reflexive passive ◮ Emotional dative

T HE PROJECT U SING C ZECH TO PARSE L ATIN D ATA SPLITS ◮ PDT: ◮ 8 training folds ◮ development fold ◮ evaluation fold

T HE PROJECT U SING C ZECH TO PARSE L ATIN D ATA SPLITS ◮ PDT: ◮ 8 training folds ◮ development fold ◮ evaluation fold ◮ LDT: ◮ Distributed as one file/author ◮ Round-robin split into 10 folds ◮ Fold 10 held out for evaluation

T HE PROJECT U SING C ZECH TO PARSE L ATIN L ANGUAGE MODELLING ◮ LM over LDT PoS sequences ◮ Best order: trigrams ◮ Best smoothing: constant discounting ( D = 0 . 1)

T HE PROJECT U SING C ZECH TO PARSE L ATIN PDT PERPLEXITY 10000 8000 Frequency 6000 4000 2000 0 0 10 20 30 40 50 Perplexity

T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSER OPTIMISATION ◮ Do parameter tuning on the Czech development set

T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSER OPTIMISATION ◮ Do parameter tuning on the Czech development set ◮ Numbers forthcoming.. .

T HE PROJECT U SING C ZECH TO PARSE L ATIN F UTURE WORK ◮ Further analysis of Latin baseline ◮ Per author/genre performance ◮ Why is MaltParser so bad? ◮ Feature engineering ◮ Learning curve: performance vs. perplexity cutoff

T HE PROJECT U SING C ZECH TO PARSE L ATIN F URTHER FORWARD ◮ Extend workflow to Talbanken/Norwegian Dependency Treebank ◮ Evaluate impact of preprocessing data for annotation ◮ Annotation speed? ◮ Annotator agreement? ◮ Annotator error?

Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U - PowerPoint PPT Presentation

T HE PROJECT U SING C ZECH TO PARSE L ATIN Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GOAL Integrating linguistic and data-driven methods Use linguistic knowledge to guide