Semantically Weighted Similarity Analysis for XML-based Content Components Jan Oevermann Christoph Lüth christoph.lueth@dfki.de jan.oevermann@dfki.de DocEng 2018, Halifax, 29.08.2018
2 <descriptive nodeid="PI-70006536"> Technical Documentation <heading>Fuel Gas Requirements</heading> <descriptive_body> <paragraph> This Section defines […] <table> <row> <entry> • XML-based content components <paragraph>Permissible range</paragra </entry> • Self-contained building blocks e.g. chapter-sized <entry> <paragraph> • Reuse, translation, aggregation, delivery <inlinedata> <si-value> • Semantic XML information models <number>5</number> <unit>°C</unit> </si-value> • Large databases of content components </inlinedata>to <inlinedata> • Product variants -> content variants <si-value> <number>120</number> <unit>°C</unit> </si-value> </inlinedata> </paragraph> 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
3 Motivation • Similar or duplicate content components • Document-based migration • Uncontrolled reuse / copying • Not checking / finding existing content • Why is this bad? • Information retrieval / content delivery • high recall, low precision • Higher translation cost for variants • Time spent (re)writing existing content 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
4 Requirements & Implications • Large amounts of content components • Computational efficient algorithm • Simple similarity measure • Reliable against semantically similar differences • (Non-)Detection of intentional variants • Weighting of semantically relevant text properties • Quality assurance • UI for checking flagged relations 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
5 Architecture 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
7 Similarity analysis • Similarity relations are symmetrical • Total number of all relations ( C ) can grow rapidly • Cosine similarity ( s ) for comparing vectors with extracted features • Threshold for similarity measure to reduce total number of relations to check ( r ) 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
8 Semantic similarity expected similarity A <paragraph nodeid="a"> This device is designed to work with a voltage of <inlinedata><si-value><number> 110 </number> <unit> V </unit></si-value></inlinedata> only. </paragraph> low B <paragraph nodeid= “b" > This device is designed to work with a voltage of <inlinedata><si-value><number> 220 </number> <unit> V </unit></si-value></inlinedata> only. </paragraph> high C <paragraph nodeid= “c" > This device works with a voltage of <inlinedata><si-value><number> 110 </number><unit> V </unit> </si-value></inlinedata> only. </paragraph> 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
9 Semantic weighting A B C • Extracted text from weighted semantic simialrity elements treated separately A 0.45 0.98 • Weighting artificially increases feature count by quantifier ( q ) B 0.90 0.36 • Influences similarity in C predictable ways 0.75 0.63 • Does not add to the complexity standard similarity of the similarity analysis 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
10 Implementation • Implemented in JavaScript • All processing is done client-side (browser), heavy calculations in own threads (web worker) • Tested efficiency on standard hardware 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
11 Workbench-like user interface 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
12 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
13 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
14 Outlook & Conclusion • RegEx or NER to in preprocessing to add XML tags • Alternative similarity measures • Integration with CCMS, give recommendations • Research dependency to information model semanticity • Simple method which can improve similarity results • Real-world relevance through customer project with Siemens Energy (TecDoc Department) 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
15 Contact Jan Oevermann Code & Demo jan.oevermann@dfki.de github.com/j-oe/semsim www.janoevermann.de semsim.fastclass.de 29.08.2018 Jan Oevermann (DFKI), DocEng 2018, Halifax
Recommend
More recommend