from monolithic xml for print web to lean xml for data
play

From monolithic XML for print/web to lean XML for data: realising - PowerPoint PPT Presentation

From monolithic XML for print/web to lean XML for data: realising linked data for dictionaries Matt Kohl & Sandro Cirulli Language Technologists Oxford University Press (OUP) 7 June 2014 Introduction Oxford University Press


  1. From monolithic XML for print/web to lean XML for data: realising linked data for dictionaries Matt Kohl & Sandro Cirulli Language Technologists Oxford University Press (OUP) 7 June 2014

  2. Introduction Oxford University Press ◮ World-renowned dictionary publisher ◮ Licensing partner for lexical data 2/18

  3. Introduction Shifts in Publishing ◮ New trends & demands ◮ Emerging technologies & markets ◮ Importance of well-structured, semantically-rich data ◮ Speed! 3/18

  4. Data Modelling Our Current Dictionary Data Models ◮ Print-oriented : designed to capture dictionary layout ◮ Monolithic : one enormous document ◮ Permissive : continually loosened to accommodate new texts Can’t give us the flexibility we need 4/18

  5. Data Modelling Requirements A new approach should: ◮ Represent language concepts , not layouts ◮ Enable data reusability for different products & services ◮ Allow only one, clear way to model any given lexical item 5/18

  6. Data Modelling The New Lexical Schema 6/18

  7. Data Conversion Moving Data into the Lexical Schema Conversion Framework Requirements ◮ Scalability : convert 40+ data-sets ◮ Standardization : harmonize variation inside the data-sets ◮ Modularity : enable customization, slotting in & out of QA, etc. 7/18

  8. Data Conversion Tools ◮ XProc ◮ XSpec ◮ Schematron & XML Schema ◮ Jenkins CI ◮ Agile methodology 8/18

  9. Data Conversion Simplified XProc pipeline print-focused XML +xml:lang = "es" 9/18

  10. Data Conversion Simplified XProc pipeline print-focused XML print-focused XML +xml:lang = "es" +xml:lang = "es" XSL transformations 9/18

  11. Data Conversion Simplified XProc pipeline print-focused XML print-focused XML print-focused XML Schematron +xml:lang = "es" +xml:lang = "es" +xml:lang = "es" XSL transformations XSL transformations validation 9/18

  12. Data Conversion Simplified XProc pipeline print-focused XML print-focused XML print-focused XML print-focused XML Schematron Schematron +xml:lang = "es" +xml:lang = "es" +xml:lang = "es" +xml:lang = "es" XSL transformations XSL transformations XSL transformations validation validation enhanced XML 9/18

  13. Data Conversion Simplified XProc pipeline print-focused XML print-focused XML print-focused XML print-focused XML print-focused XML Schematron Schematron Schematron +xml:lang = "es" +xml:lang = "es" +xml:lang = "es" +xml:lang = "es" +xml:lang = "es" XSL transformations XSL transformations XSL transformations XSL transformations validation validation validation enhanced XML enhanced XML XSL transformations XML Schema Schematron validation validation 9/18

  14. Data Conversion Simplified XProc pipeline print-focused XML print-focused XML print-focused XML print-focused XML print-focused XML print-focused XML Schematron Schematron Schematron Schematron +xml:lang = "es" +xml:lang = "es" +xml:lang = "es" +xml:lang = "es" +xml:lang = "es" +xml:lang = "es" XSL transformations XSL transformations XSL transformations XSL transformations XSL transformations validation validation validation validation enhanced XML enhanced XML enhanced XML XSL transformations XSL transformations XML Schema XML Schema Schematron Schematron Lexical Data validation validation validation validation 9/18

  15. Data Conversion Build Workflow Jenkins build - Ant script Check code in SVN - XSpec unit tests - XProc pipeline 10/18

  16. Data Conversion Build Workflow Jenkins build - Ant script Check code in SVN Check code in SVN Jenkins build Linguistic QA - XSpec unit tests - XProc pipeline 10/18

  17. Data Conversion Build Workflow Jenkins build - Ant script Check code in SVN Check code in SVN Check code in SVN Jenkins build Jenkins build Linguistic QA Linguistic QA - XSpec unit tests - XProc pipeline Update code No Passes? 10/18

  18. Data Conversion Build Workflow Jenkins build - Ant script Check code in SVN Check code in SVN Check code in SVN Check code in SVN Jenkins build Jenkins build Jenkins build Linguistic QA Linguistic QA Linguistic QA - XSpec unit tests - XProc pipeline Update code Update code No No Passes? Passes? Yes Tag release in SVN Archive artefacts via Jenkins 10/18

  19. Results & Discussion Source data A sense of ’mala´ uva’ from a monolingual Spanish dictionary <ACEPCIO ACEP="2"> <AREA-GEO>Esp</AREA-GEO> <NIVELL>coloquial</NIVELL> <SIGNIFICAT>Persona que tiene mal car´ acter o mala intenci´ on.</SIGNIFICAT> <SINONIM>malaleche.</SINONIM> </ACEPCIO> 11/18

  20. Results & Discussion OUP XML Print-focused DTD <se2 num="2"> <lg> <ge>Esp</ge> <reg>coloquial</reg> </lg> <msDict type="core"> <df>Persona que tiene mal car´ acter o mala intenci´ on.</df> <syn>malaleche</syn> </msDict> </se2> 12/18

  21. Results & Discussion OUP XML Print-focused DTD New Lexical XSD <se2 num="2"> <sense register="informal" <lg> region="ES"> <ge>Esp</ge> <definitions> <reg>coloquial</reg> <definition> </lg> <text>Persona que tiene <msDict type="core"> mal car´ acter o mala <df>Persona que intenci´ on</text> tiene mal </definition> car´ acter o mala </definitions> intenci´ on.</df> <synonyms> <syn>malaleche</syn> <synonym>malaleche</ </msDict> synonym> </se2> </synonyms> </sense> 12/18

  22. Results & Discussion OUP XML Print-focused DTD New Lexical XSD <se2 num="2"> <sense register="informal" <lg> region="ES"> <ge>Esp</ge> <reg>coloquial</ <definitions> reg> <definition> </lg> <text>Persona que tiene mal car´ acter o mala <msDict type="core"> intenci´ on</text> <df>Persona que </definition> tiene mal </definitions> car´ acter o mala <synonyms> intenci´ on.</df> <synonym>malaleche</ <syn>malaleche</syn> synonym> </msDict> </synonyms> </se2> </sense> 12/18

  23. Next steps Scale It Up ◮ Consolidate data in an XML database ◮ Build an RDF layer on top of the XML database ◮ Leverage Semantic Web to enhance our data 13/18

  24. Next Steps Prototype RDF/XML <Sense rdf:about="sense:es_noun_malauva_se_2"> <isDescribedBy rdf:resource=" definition:es_noun_malauva_se_2_def_1"/> <hasRegister rdf:resource="register:informal" /> <hasRegion rdf:resource="region:ES"/> <hasSynonym rdf:resource="lemma:a5e644"/> </Sense> <StandardDefinition rdf:about="definition:es_noun_malauva_se_2_def_1"> <rdfs:label xml:lang="es">Persona que tiene mal car´ acter o mala intenci´ on</rdfs:label> </StandardDefinition> 14/18

  25. RDF Data extraction Musical terms in English & Spanish tune: melodía hook: hook coral chorale: song: canción aria: aria chorus: estribillo choir: strain: tono coro chorus: choral air: aire chorus: coro chant: ensemble: conjunto salmodia 15/18

  26. Inference mechanism hasAntonym hasSynonym word sense X word sense Y word sense Z hasAntonym 16/18

  27. Summary ◮ Overall project requirements ◮ Moving from products to platforms and services ◮ Supporting current business needs while innovating ◮ Adapting in nimble ways to fast changing market requirements ◮ Focusing on time and cost efficiency ◮ Data model ◮ Content driven ◮ Machine interpretable ◮ Modular ◮ Evolvable/adaptable ◮ Conversion process ◮ Highly automated ◮ Modular ◮ Scalable 17/18

  28. Thank you for your attention! Any questions? Matt Kohl: matt.kohl@oup.com Sandro Cirulli: sandro.cirulli@oup.com

Recommend


More recommend