Refinement Correction Strategy for Invalid XML Documents and Regular Tree Grammars Martin Svoboda and Irena Holubova (Mlynkova) svoboda@ksi.mff.cuni.cz DEXA 2014 Munich, Germany September 2, 2014 XML and Web Engineering Research Group Charles University in Prague
Outline • Introduction • Corrections • Algorithms • Experiments • Conclusion Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 2
Introduction • Motivation Incorrect XML data ‒ Well-formedness, schema validity, data consistency • Input One XML document ‒ Well-formed but (potentially) invalid DTD or XSD schema • Goal Structural corrections of elements Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 3
Sample Correction • Document <a> <x><c/></x> <d><c/></d> <d><c/><a/></d> </a> • Grammar [a, C.D A * A] [b, D B * B] [c, C] [d, C* D A ] [d, A|B|C D B ] Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 4
Edit Operations • Edit operations Add leaf node Remove leaf node Rename node • Edit sequences Insert new subtree Delete existing subtree Repair existing subtree ‒ With an option of node renaming Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 5
Edit Operations • Example renameNode(0,c), removeLeaf(0.0), renameNode(2.1,c) • Cost Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 6
Algorithm Idea • Recursive processing From the root node towards leaf nodes… … and at each particular data tree node… … correct a sequence of its child nodes • Example C.D A * Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 7
Horizontal Correction • Automaton traversal Start ‒ Before the entire node sequence ‒ At the initial automaton state Step ‒ Before some particular node (if any) ‒ At some particular automaton state End ‒ After the entire node sequence ‒ At one of the accepting states Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 8
Correction Multigraphs • Structure Vertices Edges Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 9
Shortest Paths • Paths Source Targets Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 10
Intent Repair • Structure Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 11
Intent Signatures • Observation Different intents may lead to identical repairs ‒ We do not need to evaluate them repeatedly • Solution Intent signatures Repairs caching Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 12
Correction Strategies • Strategies Default Exploring Refinement Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 13
Refinement Strategy • Observation Until now we always worked with… ‒ … fully evaluated nested intents ‒ … and therefore their final costs • Idea Refinement exploration based on estimations Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 14
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 15
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 16
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 17
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 18
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 19
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 20
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 21
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 22
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 23
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 24
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 25
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 26
Refinement Strategy Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 27
Refinement Strategy • Exploration loop Complete vertex ‒ Explore outgoing edges ‒ Obtain first cost estimations ‒ Update current distances Incomplete vertex ‒ Request refinement of open perspective ingoing edges • Assign a quota to limit the allowed refinement progress Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 28
Execution times • Refinement strategy 4 Time in seconds 3 2 1 0 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k Number of nodes Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 29
Conclusion • Features Regular tree grammars Compact repair structure All minimal corrections No parameters required Nearly linear algorithms Refinement Correction Strategy for XML Documents September 2, 2014 DEXA 2014, Munich 30
Thank you for your attention… Faculty of Mathematics and Physics Charles University in Prague
Recommend
More recommend