computational lexicography
play

Computational Lexicography: Some proposals , David Lindemann - PowerPoint PPT Presentation

Computational Lexicography: Some proposals , David Lindemann UPV/EHU University of the Basque Country david.lindemann@ehu.eus June 2016 Computational Lexicography: Some proposals Overview (1) Intro: Computational Lexicography (2) EuDeLex


  1. Computational Lexicography: Some proposals , David Lindemann UPV/EHU University of the Basque Country david.lindemann@ehu.eus June 2016 Computational Lexicography: Some proposals

  2. Overview (1) Intro: Computational Lexicography (2) EuDeLex – a German-Basque electronic dictionary for Basque- L1 German learners (PhD project) (3) Bilingual Dictionary Drafting: Connecting Basque word senses to multilingual equivalents (Postdoc project) June 2016 Computational Lexicography: Some proposals

  3. Computational Lexicography Storing and editing of lexical data Dictionary Writing Systems: Interface for data import and editing, storing in relational database, representation (export) as XML. June 2016 Computational Lexicography: Some proposals

  4. Computational Lexicography Dictionary Publishing parsing OCR June 2016 Computational Lexicography: Some proposals

  5. Computational Lexicography Corpus Linguistics + Lexicography Corpus data provides Frequencies of lemma, word form, multiword expression, collocation, syntactic pattern... (“word sketches”) Evidence for the definition of meaning Example sentences Parallel corpora: Translated example sentences June 2016 Computational Lexicography: Some proposals

  6. Computational Lexicography Natural Language Processing (NLP) + Lexicography EDBL NLP Tools NLP Resources Corpus building Extraction of monolingual and Bilingual document alignment multilingual lexical data Sentence alignment (build PC) Linking to dictionary entries Word alignment (extract TE) etc. June 2016 Computational Lexicography: Some proposals

  7. Teamwork with computational linguists I. San Vicente I. Manterola X. Saralegi Elhuyar Foundation R. Nazar UPF (Barcelona) ► Lindemann & Nazar (2013) U. Valparaíso (Chile) ► Lindemann, Manterola, Nazar, San Vicente & Saralegi (2014) ► Lindemann & San Vicente (2015a; 2015b; in prep. 2016) June 2016 Computational Lexicography: Some proposals

  8. Creation of EuDeLex: working steps Definition of a corpus-based DE macrostructure ✔ Definition of a microstructure, suited for the needs of ✔ Basque-L1 German learners Compiling of DE-EU parallel corpus (SkE) ✔ 10% of DE Lemmalist (Letters A, B): ✔ Edition of Dictionary entries DE>EU Investigation: Bilingual Dictionary Drafting ✔ Definition of Methods ✔ Application and evaluation on German Letter A ✔ Edition and publication of all dictionary entries DE>EU ✗ Edition and publication of dictionary entries EU>DE ✗ Drafting by inverting DE>EU articles ✗ http://www.ehu.es/eudelex/ June 2016 Computational Lexicography: Some proposals

  9. Parallel and comparable DE-EU corpora Literary Corpus (parallel) Created at UPV-EHU (Sanz, Uribarri & Zubillaga 2011-13) about 2 million tokens per language 146.457 sentence pairs Sentence alignment hand revised Bible Corpus (parallel) Created by X. Saralegi 640.000 tokens per language 30.440 sentence pairs Comparable Corpora Created at Elhuyar DE and EU Wikipedia “Die Zeit” and “Berria” newspaper corpus Corpora lemmatized and POS-tagged DE: TreeTagger (SkE) EU: Eustagger (Gorka Labaka) June 2016 Computational Lexicography: Some proposals

  10. Working with the Literary „Begriff“ „im Begriff sein“ kontzeptu 18 Parallel Corpus (1) -tzera joan 5 -tzekotan izan 1 ideia 11 -tzeko zorian 4 -tzera 1 Parallel KWIC from lemmatized and hitza 2 apoderatzen POS-tagged DE-EU corpus using SkE. adigai 1 izan Example: “Begriff” (noun) -tzear egon 3 ekarri 1 burutapena 1 -tzeko asmoa 3 gutxi falta + 1 ezagutza 1 izan subjkt. gai 1 -tzeko asmotan 2 hasia izan 1 hitz 1 egon ikusmolde 1 -tzea pentsatu 1 inf. nahi izan 1 pentsakera 1 -tzear izan 1 inf. nahian 1 termino 1 -tzeko duda egin 1 egin behar izan 1 ulerbide 1 LU and TE candidates in Parallel Corpus ulerkera 1 (SkE). Example: “Begriff” (noun) kein Begriff sein ez ezagutu 1 kein Begriff sein horren entzuterik ere ez izan 1 keinen andern Begriff haben als ... ... baino ez pentsatu 1 nicht allzu schnell von Begriff sein ez oso azkarra izan 1 schwer von Begriff sein burugogorra izan 1 sich einen stillen Begriff machen gutxi gora behera irudika ahal izan 1 sich kaum einen Begriff machen ozta-ozta ideia bat izan 1 sich keinen Begriff machen ez jakin 1 sich keinen Begriff machen ezin imajinatu ere egin 1 über alle Begriffe ezin esan bezain 1 Der Begriff X X 1 einen Begriff geben aditzera eman 1 June 2016 Computational Lexicography: Some proposals

  11. Working with the Literary Parallel Corpus (2) Gemütlichkeit: 4 hits Schadenfreude: 10 hits goxotasun (voll Schadenfreude sein) zoritxarraz poztu patxada (Schadenfreude empfinden) lasaitasun maltzur sentitu konfortea bozkario gozatze modu bat poz txiki bat ● 2 terms ”very hard to translate“ alaitasun maltzur ● all hapax TE besteak umiliatzeko poza ● cf. Teubert 2002 poz gaizto (aus Schadenfreude) besteren gaitzak ninduen pozten kalte poz June 2016 Computational Lexicography: Some proposals

  12. Automatic TE pairing for Bilingual Dictionary Drafting Corpus based 1. DE-EU parallel corpora: GIZA++ (Elhuyar) 2. DE-EU parallel corpus: Bifid (R. Nazar) Mixed 3. Elhuyar Pibolex mixed approach Lexical Knowledge based 4. Wikipedia IL-links 5. de.wiktionary: Basque links 6. GermaNet / EusWN (PWN as pivot) June 2016 Computational Lexicography: Some proposals

  13. BDD methods: evaluation results overview Criterion Giza++ Bifid Pibolex Wikipedia Wiktionary WordNet BG: Recall on GS lemmalist 19% 5% 16% 7% 4% 21% BG: lemma with 1+ good TE 63% 95% 79% 89% 100% 90% BG: lemma with all TE 42% 95% 57% 89% 97% 78% „good“ Lemma with 1+ good TE: 13% 5% 13% 7% 4% 21% Recall on GS lemmalist Lemma with all TE „good“: 9% 5% 9% 7% 4% 18% Recall on GS lemmalist June 2016 Computational Lexicography: Some proposals

  14. BDD methods: result interpretation Overall Outcome (mixed methods) 80% recall up to 60% of lemmata in Gold Standard lemmalist provided with 1+ good TE A half of these (30%) is noise free 73% of these are nouns (54% of GS lemmas are nouns (Wikipedia only nouns)) Three groups of Dictionary Draft Data (1) Wiktionary, Wikipedia, Bifid (Lit Corpus): GS: 40,6% 1+ good TE Nearly 100% precision Direct pasting into Dictionary Database, post-editing (2) Bifid (Bible Corpus), WordNet: GS: 20% 1+ good TE High precision Revision by hand, then paste into Dictionary Database (3) Giza, Elhuyar Pivot: GS: 18,7% 1+ good TE Lower precision More strategies for noise reduction needed Data displayed as support in Dictionary entry editing June 2016 Computational Lexicography: Some proposals

  15. EuDeLex Online-Publishing June 2016 Computational Lexicography: Some proposals

  16. From EuDeLex (predoc) to EuMultiLex (postdoc) Application of BDD methods to more language pairs Wikipedia? 292 languages Wiktionary? 170 languages Parallel Corpora? with Basque, difficult WordNet ? 34 with free licence , around 30 non-free Starting from Basque Definition of a basic lemma list we want to cover Semi-automatically obtained draft for manual edition Preparation of draft dictionary set Manual edition of German-Basque-German: David Other combinations: Lexicographers skilled in both languages June 2016 Computational Lexicography: Some proposals

  17. Definition of a basic lemma list Corpus-based frequency lemma list for Basque Lemmata extracted from ETC (Sarasola, Salaburu & Landa 2013), and Elh200 (Leturia 2014) Comparison to 6 reference resources ► Lindemann & San Vicente (2015a; 2015b) June 2016 Computational Lexicography: Some proposals

  18. Basque Dictionary Draft: (1) Homograph Level Basic list of lemma-signs, frequency data from Elh200 corpus June 2016 Computational Lexicography: Some proposals

  19. (1) Homograph, (2) Syntactical Entity Syntactical Entities (LemPos-entities) from Elh200 corpus Corpus tagged with EusTagger, based on EDBL data Frequency data for each entity June 2016 Computational Lexicography: Some proposals

  20. (1) Homograph, (2) Syntactical Entity, (3) Sense Word senses from EusWN Linking of senses to syntactical entities (as child elements) ► Lindemann & San Vicente (in press, 2016) June 2016 Computational Lexicography: Some proposals

  21. Drafted Basque dictionary content Corpus- SE with Total Polysemy SE SE based SE one or EusWN ratio present in present in more Word corpus EusWN EusWN senses but not in but not Word EusWN found in senses corpus Verbs 4,151 1,636 6,567 2.01 2,515 279 Common 23,921 15,193 30,613 4.01 8,728 3,479 Nouns Proper 2,443 132 153 1.16 2,311 60 Nouns Adjectives 6,147 50 141 2.82 6,097 8 Adverbs 1,556 0 0 0.00 1,556 0 Total 38,218 17,011 37,474 2.20 21,207 3,826 June 2016 Computational Lexicography: Some proposals

  22. Dictionary Draft SE Gap Detection: semi-automatic Blank SE (present in EDBL, not in EusWN): Find corresponding synset in Princeton WordNet, copy ID June 2016 Computational Lexicography: Some proposals

Recommend


More recommend