something from nothing
play

Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U - PowerPoint PPT Presentation

T HE PROJECT U SING C ZECH TO PARSE L ATIN Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GOAL Integrating linguistic and data-driven methods Use linguistic knowledge to guide


  1. T HE PROJECT U SING C ZECH TO PARSE L ATIN Something from nothing Arne Skjærholt LTG seminar

  2. T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GOAL ◮ Integrating linguistic and data-driven methods ◮ Use linguistic knowledge to guide data-driven methods ◮ Leverage data-driven approaches to inform linguistic and rule-driven methods ?

  3. T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front

  4. T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front ◮ Norwegian ◮ Decent resources at word-level ◮ No syntactic resources

  5. T HE PROJECT U SING C ZECH TO PARSE L ATIN W HAT TO DO ? ◮ Focus on syntax ◮ Focus on languages with little resources up-front ◮ Norwegian ◮ Decent resources at word-level ◮ No syntactic resources ◮ Latin ◮ Long tradition of linguistic inquiry ◮ Quality and quantity of annotated data extremely variable

  6. T HE PROJECT U SING C ZECH TO PARSE L ATIN P LANS ◮ Dependency corpus adaptation ◮ Constrained CRF models ◮ Annotation studies

  7. T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it

  8. T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it 3. ???

  9. T HE PROJECT U SING C ZECH TO PARSE L ATIN C URRENT PROJECT 1. Take a large corpus 2. Remove 90% of the information in it 3. ??? 4. Profit!

  10. T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus

  11. T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus 2. Train language model over target language PoS sequences 3. Filter source corpus with LM

  12. T HE PROJECT U SING C ZECH TO PARSE L ATIN T HE GENERAL IDEA 1. Delexicalise source language corpus 2. Train language model over target language PoS sequences 3. Filter source corpus with LM 4. Train model, parse target

  13. T HE PROJECT U SING C ZECH TO PARSE L ATIN C ORPORA ◮ Prague Dependency Treebank (PDT) ◮ 1.5M tokens ◮ Dependency syntax and complex morphological annotation

  14. T HE PROJECT U SING C ZECH TO PARSE L ATIN C ORPORA ◮ Prague Dependency Treebank (PDT) ◮ 1.5M tokens ◮ Dependency syntax and complex morphological annotation ◮ Latin Dependency Treebank (LDT) ◮ 53,143 tokens ◮ Annotation scheme based on PDT

  15. T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSING L ATIN ◮ Previous baseline: MSTParser, 65% unlabelled, 53% labelled accuracy (Bamman & Crane 2008) ◮ New baseline: MSTParser, 64% unlabelled, 54% labelled

  16. T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSING L ATIN ◮ Previous baseline: MSTParser, 65% unlabelled, 53% labelled accuracy (Bamman & Crane 2008) ◮ New baseline: MSTParser, 64% unlabelled, 54% labelled Prose 40,884 Poetry 12,259 Prose/poetry distribution

  17. T HE PROJECT U SING C ZECH TO PARSE L ATIN W ORKFLOW PDT LDT reformat reformat CoNLL CoNLL tagset map tagset map Common tagset(s) Common tagset(s) delexicalise delexicalise Delexicalised Delexicalised filter train train Parser LM Parse Latin

  18. T HE PROJECT U SING C ZECH TO PARSE L ATIN T AGSETS ◮ LDT annotation guidelines derived from PDT ◮ PoS mappings: ◮ LDT has a participle tag ◮ Czech has particles, Latin doesn’t

  19. T HE PROJECT U SING C ZECH TO PARSE L ATIN T AGSETS ◮ LDT annotation guidelines derived from PDT ◮ PoS mappings: ◮ LDT has a participle tag ◮ Czech has particles, Latin doesn’t ◮ Deprel mappings: ◮ Reflexive tantum ◮ Reflexive passive ◮ Emotional dative

  20. T HE PROJECT U SING C ZECH TO PARSE L ATIN D ATA SPLITS ◮ PDT: ◮ 8 training folds ◮ development fold ◮ evaluation fold

  21. T HE PROJECT U SING C ZECH TO PARSE L ATIN D ATA SPLITS ◮ PDT: ◮ 8 training folds ◮ development fold ◮ evaluation fold ◮ LDT: ◮ Distributed as one file/author ◮ Round-robin split into 10 folds ◮ Fold 10 held out for evaluation

  22. T HE PROJECT U SING C ZECH TO PARSE L ATIN L ANGUAGE MODELLING ◮ LM over LDT PoS sequences ◮ Best order: trigrams ◮ Best smoothing: constant discounting ( D = 0 . 1)

  23. T HE PROJECT U SING C ZECH TO PARSE L ATIN PDT PERPLEXITY 10000 8000 Frequency 6000 4000 2000 0 0 10 20 30 40 50 Perplexity

  24. T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSER OPTIMISATION ◮ Do parameter tuning on the Czech development set

  25. T HE PROJECT U SING C ZECH TO PARSE L ATIN P ARSER OPTIMISATION ◮ Do parameter tuning on the Czech development set ◮ Numbers forthcoming.. .

  26. T HE PROJECT U SING C ZECH TO PARSE L ATIN F UTURE WORK ◮ Further analysis of Latin baseline ◮ Per author/genre performance ◮ Why is MaltParser so bad? ◮ Feature engineering ◮ Learning curve: performance vs. perplexity cutoff

  27. T HE PROJECT U SING C ZECH TO PARSE L ATIN F URTHER FORWARD ◮ Extend workflow to Talbanken/Norwegian Dependency Treebank ◮ Evaluate impact of preprocessing data for annotation ◮ Annotation speed? ◮ Annotator agreement? ◮ Annotator error?

Recommend


More recommend