semantically weighted similarity analysis
play

Semantically Weighted Similarity Analysis for XML-based Content - PowerPoint PPT Presentation

Semantically Weighted Similarity Analysis for XML-based Content Components Jan Oevermann Christoph Lth christoph.lueth@dfki.de jan.oevermann@dfki.de DocEng 2018, Halifax, 29.08.2018 2 <descriptive nodeid="PI-70006536">


  1. Semantically Weighted Similarity Analysis for XML-based Content Components Jan Oevermann Christoph Lüth christoph.lueth@dfki.de jan.oevermann@dfki.de DocEng 2018, Halifax, 29.08.2018

  2. 2 <descriptive nodeid="PI-70006536"> Technical Documentation <heading>Fuel Gas Requirements</heading> <descriptive_body> <paragraph> This Section defines […] <table> <row> <entry> • XML-based content components <paragraph>Permissible range</paragra </entry> • Self-contained building blocks e.g. chapter-sized <entry> <paragraph> • Reuse, translation, aggregation, delivery <inlinedata> <si-value> • Semantic XML information models <number>5</number> <unit>°C</unit> </si-value> • Large databases of content components </inlinedata>to <inlinedata> • Product variants -> content variants <si-value> <number>120</number> <unit>°C</unit> </si-value> </inlinedata> </paragraph> 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  3. 3 Motivation • Similar or duplicate content components • Document-based migration • Uncontrolled reuse / copying • Not checking / finding existing content • Why is this bad? • Information retrieval / content delivery • high recall, low precision • Higher translation cost for variants • Time spent (re)writing existing content 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  4. 4 Requirements & Implications • Large amounts of content components • Computational efficient algorithm • Simple similarity measure • Reliable against semantically similar differences • (Non-)Detection of intentional variants • Weighting of semantically relevant text properties • Quality assurance • UI for checking flagged relations 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  5. 5 Architecture 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  6. 7 Similarity analysis • Similarity relations are symmetrical • Total number of all relations ( C ) can grow rapidly • Cosine similarity ( s ) for comparing vectors with extracted features • Threshold for similarity measure to reduce total number of relations to check ( r ) 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  7. 8 Semantic similarity expected similarity A <paragraph nodeid="a"> This device is designed to work with a voltage of <inlinedata><si-value><number> 110 </number> <unit> V </unit></si-value></inlinedata> only. </paragraph> low B <paragraph nodeid= “b" > This device is designed to work with a voltage of <inlinedata><si-value><number> 220 </number> <unit> V </unit></si-value></inlinedata> only. </paragraph> high C <paragraph nodeid= “c" > This device works with a voltage of <inlinedata><si-value><number> 110 </number><unit> V </unit> </si-value></inlinedata> only. </paragraph> 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  8. 9 Semantic weighting A B C • Extracted text from weighted semantic simialrity elements treated separately A 0.45 0.98 • Weighting artificially increases feature count by quantifier ( q ) B 0.90 0.36 • Influences similarity in C predictable ways 0.75 0.63 • Does not add to the complexity standard similarity of the similarity analysis 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  9. 10 Implementation • Implemented in JavaScript • All processing is done client-side (browser), heavy calculations in own threads (web worker) • Tested efficiency on standard hardware 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  10. 11 Workbench-like user interface 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  11. 12 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  12. 13 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  13. 14 Outlook & Conclusion • RegEx or NER to in preprocessing to add XML tags • Alternative similarity measures • Integration with CCMS, give recommendations • Research dependency to information model semanticity • Simple method which can improve similarity results • Real-world relevance through customer project with Siemens Energy (TecDoc Department) 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

  14. 15 Contact Jan Oevermann Code & Demo jan.oevermann@dfki.de github.com/j-oe/semsim www.janoevermann.de semsim.fastclass.de 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax

Recommend


More recommend