Digital Humanities Workshop, Sep 9 – 11, 2014, Batumi, Georgia Linking Machine-Readable Dictionaries Christian Chiarcos Applied Computational Linguistics Lab chiarcos@informatik.uni-frankfurt.de 1
Linking Machine-Readable Dictionaries • Motivation: Aggregating information – from different dictionaries – from dictionaries and automatically analyzed text • State of the art on machine-readable dictionaries – XML (TEI, LMF) – RDF (lemon) • Example – Converting, linking and querying multilingual Wiktionaries
The future of the dictionary … „The three things no young person owns or uses and often don‘t realise exist: an alarm clock, an address book and a dictionary … At university I didn‘t meet a single person who owned any of them“ http://guardian.co.uk/books/booksblog/2012/sep/13/dictio naries-democratic-crowdsourcing/
The future of the dictionary … „The three things no young person owns or uses and often don‘t realise exist: an alarm clock, an address book and a dictionary … At university I didn‘t meet a single person who owned any of them“ http://guardian.co.uk/books/booksblog/2012/sep/13/dictio naries-democratic-crowdsourcing/ „[D]ictionaries are not dead, they just smell funny“ Ilan Kernerman, CEO KDictionaries, Kernerman Dictionary News 21 (July 2013): 1, paraphrasing Frank Zappa‘s quote on Jazz (1974)
The future of the dictionary … „[D]ictionaries … lose their autonomous identity and disappear in language technology. Machine translation, word processors, … and the like incorporate dictionary content and apply it in new forms“ Ilan Kernerman, CEO KDictionaries, Kernerman Dictionary News 21 (July 2013): 1 „[T]he message is clear and unambiguous: the future of the dictionary is digital.“ Stephen Bullon, Macmillan Education, upon announcing that Macmillan will no longer publish print dictionaries, Nov 2012
The future of the dictionary … … is digital – no space limitations • adding context information, e.g., from corpora – dynamic ordering & search • no index optimization for manual lookup – information aggregation • integrating information from different sources
The future of the dictionary … … is digital – no space limitations • adding context information, e.g., from corpora – dynamic ordering & search • no index optimization for manual lookup – information aggregation • integrating information from different sources two use cases: • cross-lingual dictionary lookup • text mining for archaeologists
Information Aggregation I Cross-lingual search • Assume you‘re a speaker of language X, say, German, and are interested in working with text in language Y, say, Georgian – Statistical machine translation may give you an idea, but you certainly want to counter- check with a dictionary ...
Information Aggregation I Cross-lingual search • Assume you‘re a speaker of language X, say, German, and are interested in working with text in language Y, say, Georgian – Statistical machine translation may give you an idea, but you certainly want to counter- check with a dictionary ... ... unfortunately, you don‘t have one
Information Aggregation I Cross-lingual search • Assume you‘re a speaker of language X, say, German, and are interested in working with text in language Y, say, Georgian • We do have a Georgian-English dictionary, though, and (luckily) a English-German one • Given a proper representation, storage and query formalisms, it is possible to perform a transitive query using English as a pivot language
Information Aggregation I Cross-lingual search Abschnitt Ader Basis foot Bein Etappe dict.leo.org Fuß http://www.georgianweb. Fußbreit com/pdf/lexicon.pdf leg dict.leo.org Fußende ფეხი Fußlinie Fußmauer Fußpunkt Hachse Kathete Mastfuß Programmzweig Strecke Schaft Strang Stollen Standfuß Sockel Schenkel Tritt Standvorrichtung Schlägel Sohle Segelunterliek
Information Aggregation I Cross-lingual search • Unfortunately, using English introduces a lot of noise – 2 English translations, 27 (!) German translations • But we can combine multiple paths, e.g., one using English as a pivot, one using Russian – elements in the intersection should be more reliable
Information Aggregation I Cross-lingual search Abschnitt Ader Basis foot Bein Etappe dict.leo.org Fuß http://www.georgianweb. Fußbreit com/pdf/lexicon.pdf leg dict.leo.org Fußende ფეხი Fußlinie Fußmauer http://meskhi.net/lexicon Fußpunkt нога dict.leo.org Hachse Kathete Mastfuß Programmzweig Strecke Schaft Spielbein Strang Stollen Standfuß Sockel Schenkel Tritt Standvorrichtung Schlägel Sohle Segelunterliek
Information Aggregation I Cross-lingual search • Unfortunately, using English introduces a lot of noise – 2 English translations, 27 (!) German translations • But we can combine multiple paths, e.g., one using English as a pivot, one using Russian – elements in the intersection should be more reliable 27 English-based translations + 3 Russian-based translations = 2 shared translations
Information Aggregation I Cross-lingual search • In a similar way, words missing from the Russian (or the English) path may be taken from the other one – more noise, but better coverage 27 English-based translations + 3 Russian-based translations = 28 possible translations – e.g., German Spielbein „free leg“
Information Aggregation I Jargon : A Prototype • student project @ GU Frankfurt • enter a word (in any language) and a target language • consult different machine-readable dictionaries to find a path into the target language • visualize results together with their „path“
Information Aggregation I Jargon : A Prototype
Information Aggregation I Jargon : A Prototype • Jargon uses lexical resources provided by different groups – using a shared vocabulary • lemon, more in 10 minutes => joint queries • still under development – prototype on restricted data set
Information Aggregation II Multilingual Semantic Web • a system for text mining (open information extraction) from archeological reports • extract machine-readable information from plain text – currently, English only • in the longer perspective, German and Dutch – http://corpora.acoli.informatik.uni- frankfurt.de/text-mining-webservice
Information Aggregation II Multilingual Semantic Web Given a PDF document
Information Aggregation II Multilingual Semantic Web Upload to server
Information Aggregation II Multilingual Semantic Web Perform NLP analysis
Information Aggregation II Multilingual Semantic Web Visualize data
Information Aggregation II Multilingual Semantic Web e.g. arch. periods
Information Aggregation II Multilingual Semantic Web or query in the results
Information Aggregation II Multilingual Semantic Web or query in the results TEXT Dr Irakli Iashvili spent a month at the Heberden Coin Room at the Ashmolean Museum , also with the support of the British Academy , working on the coinage of the Black Sea in general , and the coins found at QUERY Pichvnari in particular . TRIPLES Result
Information Aggregation II Multilingual Semantic Web or query in the results In this query, the only information-bearing element is „:work“ If we define that „:work“ entails „:bearbeitet“ (the German translation), we can formulate the same query in German i.e. ?a :bearbeitet ?c
Linking Machine-Readable Dictionaries • Motivation: Aggregating information – from different dictionaries – from dictionaries and automatically analyzed text • State of the art on machine-readable dictionaries – XML – RDF • Example – Converting, linking and querying multilingual Wiktionaries
Machine Readable Dictionaries XML • Text Encoding Initiative (TEI) – specifications for markup of digital-born documents – originally closely oriented towards digital editions of printed books – rich metadata (TEI header) – semantic markup ( div, seg, verse, … ) – limited interoperability • many different ways to represent the same information => information aggregation ???
Machine Readable Dictionaries XML • Lexical Markup Framework (LMF) – ISO standard for representing machine- readable dictionaries – an abstract model with XML specifications (DTD) – concrete application requires an instantiation extending the DTD violating the original DTD in order to use this standard, you need to break it
Machine Readable Dictionaries XML • Lexical Markup Framework (LMF) – ISO standard for representing machine- readable dictionaries – an abstract model with XML specifications (DTD) – concrete application requires an instantiation extending the DTD violating the original DTD in order to use this standard, you need to break it suggestions for alternative representations of LMF, e.g., RDF (Francopoulo 2006)
Resource Description Framework (RDF) • W3C standard (1999) – generic data model: directed labeled graph • nodes, edges, labels – originally developed to provide metadata about resources • e.g., journals in a bookstore and eBooks in an online shop – resources are unambiguously identified in the web of data by Uniform Resource Identifiers URIs)
Recommend
More recommend