An Incremental Correction Algorithm for XML Documents and Single Type Tree Grammars Martin Svoboda, Irena Mlýnková XML and Web Engineering Research Group Charles University in Prague The Czech Republic 24 April 2012 NDT 2012 Dubai, United Arab Emirates
Outline • Introduction Motivation Objectives • Approach Corrections Algorithms Experiments • Conclusion An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 2
Introduction • Motivation Incorrect XML documents ‒ Well-formedness ‒ Schema validity ‒ Data consistency ‒ … Strategies ‒ Adjusting algorithms ‒ Correcting data An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 3
Introduction • Problem Input ‒ One XML document • Well-formed but (potentially) invalid ‒ DTD or XML Schema Output ‒ All minimal repairs • Structural corrections of elements An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 4
Definitions • Document Trees ‒ Nodes for elements and texts ‒ Prefix numbering of nodes Example ε a <a> <x><d/></x> 0 1 x d <d><d/><d/></d> </a> 0.0 1.0 1.1 d d d An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 5
Definitions • Schema Grammars ‒ Terminal symbols for element names ‒ Nonterminal symbols for types ‒ Production rules based on regular expressions Classes ‒ Regular tree grammars ‒ Single type tree grammars (XML Schema) ‒ Local tree grammars (DTD) An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 6
Model • Edit operations ADD leaf, REMOVE leaf, RENAME label • Update operations Sequences of edit operations INSERT , DELETE , REPAIR , RENAME • Cost function Unit costs of edit operations An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 7
Model ε <a> Type Name Model a <x><d/></x> A a C.D* <d> 0 1 x d B b D* <d/><d/> C c empty </d> 0.0 1.0 1.1 d d d D d D* </a> ε ε ε a a b 0 1 0 1 2 0 1 c d c d d d d 1.0 1.1 1.0 2.0 2.1 0.0 1.0 1.1 d d d d d d d d d An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 8
Algorithm • Naive algorithm Task ‒ At each level of top-down tree processing… …find repairs for a sequence of sibling nodes Steps ‒ Construct a repairing multigraph ‒ Recursively repair subtrees ‒ Compose a repairing structure An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 9
Algorithm ε a 0 1 x d x d 0 1 2 0.0 1.0 1.1 d d d 00 10 20 RENAME Type Name Model A a C.D* 01 11 21 B b D* DELETE REPAIR C c empty INSERT D d D* 02 12 22 An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 10
Algorithm ε a x d 0 1 c d 0 1 2 1.0 1.1 d d d 00 10 20 INSERT ε RENAME 1 a 2 01 11 21 0 1 2 c d d REPAIR RENAME 1.0 2.0 2.1 d d d 1 0 02 12 22 0 REPAIR An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 11
Algorithms • Naive • Dynamic Directly follows Dijkstra’s algorithm and, thus, only required multigraph parts are explored • Caching Avoids repeated recursive computations by detecting and caching identical repairs • Incremental Evaluates repairing multigraphs step by step An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 12
Algorithms • Incremental Task ‒ Structure encapsulating multigraph evaluation • Multigraph structure • Dijkstra’s variables Scheduler ‒ Processing of an activated task: • Request further refinement of perspective edges • Activate corresponding tasks for nested problems An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 13
Experiments • Data Single type tree grammar ‒ 7 nonterminal symbols ‒ 6 terminal symbols ‒ Recursion, iteration XML data trees ‒ Maximal depth 5, fan-out 8 ‒ Elements from 100 to 1,000 ‒ 20 files for each particular size ‒ Average values from 20 repeats An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 14
Experiments • Execution time in miliseconds 40 Incremental 30 Caching 20 10 0 0 200 400 600 800 1000 Elements An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 15
Experiments • Number of correction intents Equals to a number of distinct multigraphs 4000 Caching 3000 Incremental 2000 1000 0 0 200 400 600 800 1000 Elements An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 16
Conclusion • Contributions Single type tree grammars Always all minimal repairs New incremental algorithm • Advantages Compact repair structure Prototype implementation An Incremental Correction Algorithm for XML Documents 24 April 2012 NDT 2012, Dubai, UAE 17
Thank you for your attention… XML and Web Engineering Research Group Charles University in Prague
Recommend
More recommend