dictionaries
play

Dictionaries Christian Chiarcos Applied Computational Linguistics - PowerPoint PPT Presentation

Digital Humanities Workshop, Sep 9 11, 2014, Batumi, Georgia Linking Machine-Readable Dictionaries Christian Chiarcos Applied Computational Linguistics Lab chiarcos@informatik.uni-frankfurt.de 1 Linking Machine-Readable Dictionaries


  1. Digital Humanities Workshop, Sep 9 – 11, 2014, Batumi, Georgia Linking Machine-Readable Dictionaries Christian Chiarcos Applied Computational Linguistics Lab chiarcos@informatik.uni-frankfurt.de 1

  2. Linking Machine-Readable Dictionaries • Motivation: Aggregating information – from different dictionaries – from dictionaries and automatically analyzed text • State of the art on machine-readable dictionaries – XML (TEI, LMF) – RDF (lemon) • Example – Converting, linking and querying multilingual Wiktionaries

  3. The future of the dictionary … „The three things no young person owns or uses and often don‘t realise exist: an alarm clock, an address book and a dictionary … At university I didn‘t meet a single person who owned any of them“ http://guardian.co.uk/books/booksblog/2012/sep/13/dictio naries-democratic-crowdsourcing/

  4. The future of the dictionary … „The three things no young person owns or uses and often don‘t realise exist: an alarm clock, an address book and a dictionary … At university I didn‘t meet a single person who owned any of them“ http://guardian.co.uk/books/booksblog/2012/sep/13/dictio naries-democratic-crowdsourcing/ „[D]ictionaries are not dead, they just smell funny“ Ilan Kernerman, CEO KDictionaries, Kernerman Dictionary News 21 (July 2013): 1, paraphrasing Frank Zappa‘s quote on Jazz (1974)

  5. The future of the dictionary … „[D]ictionaries … lose their autonomous identity and disappear in language technology. Machine translation, word processors, … and the like incorporate dictionary content and apply it in new forms“ Ilan Kernerman, CEO KDictionaries, Kernerman Dictionary News 21 (July 2013): 1 „[T]he message is clear and unambiguous: the future of the dictionary is digital.“ Stephen Bullon, Macmillan Education, upon announcing that Macmillan will no longer publish print dictionaries, Nov 2012

  6. The future of the dictionary … … is digital – no space limitations • adding context information, e.g., from corpora – dynamic ordering & search • no index optimization for manual lookup – information aggregation • integrating information from different sources

  7. The future of the dictionary … … is digital – no space limitations • adding context information, e.g., from corpora – dynamic ordering & search • no index optimization for manual lookup – information aggregation • integrating information from different sources two use cases: • cross-lingual dictionary lookup • text mining for archaeologists

  8. Information Aggregation I Cross-lingual search • Assume you‘re a speaker of language X, say, German, and are interested in working with text in language Y, say, Georgian – Statistical machine translation may give you an idea, but you certainly want to counter- check with a dictionary ...

  9. Information Aggregation I Cross-lingual search • Assume you‘re a speaker of language X, say, German, and are interested in working with text in language Y, say, Georgian – Statistical machine translation may give you an idea, but you certainly want to counter- check with a dictionary ... ... unfortunately, you don‘t have one

  10. Information Aggregation I Cross-lingual search • Assume you‘re a speaker of language X, say, German, and are interested in working with text in language Y, say, Georgian • We do have a Georgian-English dictionary, though, and (luckily) a English-German one • Given a proper representation, storage and query formalisms, it is possible to perform a transitive query using English as a pivot language

  11. Information Aggregation I Cross-lingual search Abschnitt Ader Basis foot Bein Etappe dict.leo.org Fuß http://www.georgianweb. Fußbreit com/pdf/lexicon.pdf leg dict.leo.org Fußende ფეხი Fußlinie Fußmauer Fußpunkt Hachse Kathete Mastfuß Programmzweig Strecke Schaft Strang Stollen Standfuß Sockel Schenkel Tritt Standvorrichtung Schlägel Sohle Segelunterliek

  12. Information Aggregation I Cross-lingual search • Unfortunately, using English introduces a lot of noise – 2 English translations, 27 (!) German translations • But we can combine multiple paths, e.g., one using English as a pivot, one using Russian – elements in the intersection should be more reliable

  13. Information Aggregation I Cross-lingual search Abschnitt Ader Basis foot Bein Etappe dict.leo.org Fuß http://www.georgianweb. Fußbreit com/pdf/lexicon.pdf leg dict.leo.org Fußende ფეხი Fußlinie Fußmauer http://meskhi.net/lexicon Fußpunkt нога dict.leo.org Hachse Kathete Mastfuß Programmzweig Strecke Schaft Spielbein Strang Stollen Standfuß Sockel Schenkel Tritt Standvorrichtung Schlägel Sohle Segelunterliek

  14. Information Aggregation I Cross-lingual search • Unfortunately, using English introduces a lot of noise – 2 English translations, 27 (!) German translations • But we can combine multiple paths, e.g., one using English as a pivot, one using Russian – elements in the intersection should be more reliable 27 English-based translations + 3 Russian-based translations = 2 shared translations

  15. Information Aggregation I Cross-lingual search • In a similar way, words missing from the Russian (or the English) path may be taken from the other one – more noise, but better coverage 27 English-based translations + 3 Russian-based translations = 28 possible translations – e.g., German Spielbein „free leg“

  16. Information Aggregation I Jargon : A Prototype • student project @ GU Frankfurt • enter a word (in any language) and a target language • consult different machine-readable dictionaries to find a path into the target language • visualize results together with their „path“

  17. Information Aggregation I Jargon : A Prototype

  18. Information Aggregation I Jargon : A Prototype • Jargon uses lexical resources provided by different groups – using a shared vocabulary • lemon, more in 10 minutes => joint queries • still under development – prototype on restricted data set

  19. Information Aggregation II Multilingual Semantic Web • a system for text mining (open information extraction) from archeological reports • extract machine-readable information from plain text – currently, English only • in the longer perspective, German and Dutch – http://corpora.acoli.informatik.uni- frankfurt.de/text-mining-webservice

  20. Information Aggregation II Multilingual Semantic Web Given a PDF document

  21. Information Aggregation II Multilingual Semantic Web Upload to server

  22. Information Aggregation II Multilingual Semantic Web Perform NLP analysis

  23. Information Aggregation II Multilingual Semantic Web Visualize data

  24. Information Aggregation II Multilingual Semantic Web e.g. arch. periods

  25. Information Aggregation II Multilingual Semantic Web or query in the results

  26. Information Aggregation II Multilingual Semantic Web or query in the results TEXT Dr Irakli Iashvili spent a month at the Heberden Coin Room at the Ashmolean Museum , also with the support of the British Academy , working on the coinage of the Black Sea in general , and the coins found at QUERY Pichvnari in particular . TRIPLES Result

  27. Information Aggregation II Multilingual Semantic Web or query in the results In this query, the only information-bearing element is „:work“ If we define that „:work“ entails „:bearbeitet“ (the German translation), we can formulate the same query in German i.e. ?a :bearbeitet ?c

  28. Linking Machine-Readable Dictionaries • Motivation: Aggregating information – from different dictionaries – from dictionaries and automatically analyzed text • State of the art on machine-readable dictionaries – XML – RDF • Example – Converting, linking and querying multilingual Wiktionaries

  29. Machine Readable Dictionaries XML • Text Encoding Initiative (TEI) – specifications for markup of digital-born documents – originally closely oriented towards digital editions of printed books – rich metadata (TEI header) – semantic markup ( div, seg, verse, … ) – limited interoperability • many different ways to represent the same information => information aggregation ???

  30. Machine Readable Dictionaries XML • Lexical Markup Framework (LMF) – ISO standard for representing machine- readable dictionaries – an abstract model with XML specifications (DTD) – concrete application requires an instantiation  extending the DTD  violating the original DTD  in order to use this standard, you need to break it

  31. Machine Readable Dictionaries XML • Lexical Markup Framework (LMF) – ISO standard for representing machine- readable dictionaries – an abstract model with XML specifications (DTD) – concrete application requires an instantiation  extending the DTD  violating the original DTD  in order to use this standard, you need to break it  suggestions for alternative representations of LMF, e.g., RDF (Francopoulo 2006)

  32. Resource Description Framework (RDF) • W3C standard (1999) – generic data model: directed labeled graph • nodes, edges, labels – originally developed to provide metadata about resources • e.g., journals in a bookstore and eBooks in an online shop – resources are unambiguously identified in the web of data by Uniform Resource Identifiers URIs)

Recommend


More recommend