inferring xml schema definitions from xml data
play

Inferring XML Schema Definitions from XML Data Geert Jan Bex, Frank - PowerPoint PPT Presentation

Inferring XML Schema Definitions from XML Data Geert Jan Bex, Frank Neven and Stijn Vansummeren Hasselt University and transnational University of Limburg, Belgium Overview Introduction Complete algorithm i L OCAL Heuristic i XSD


  1. Inferring XML Schema Definitions from XML Data Geert Jan Bex, Frank Neven and Stijn Vansummeren Hasselt University and transnational University of Limburg, Belgium

  2. Overview • Introduction • Complete algorithm i L OCAL • Heuristic i XSD • Experiments • Conclusions

  3. Motivation for schemas • Why schemas? – automation & optimization of search – integration of XML data sources – translation & processing of XML data – used by software tools, e.g., JAXB, Castor – schema matching & model management • Why infer schemas? – 50 % of XML document on the web have none [Barbosa et al., 2005] – 33 % of schemas are not valid [Bex et al., 2004, 2005] real world XML & XSDs

  4. Motivation for XSD inference • DTD inference – XTract [Garofalakis et al., 2003] – trang [Clark] – i DTD [Bex et al., 2006] • XSD inference – trang output XSD syntax, – XStruct but equivalent to DTD – JAXB, .Net expressive power limited to that of DTDs!

  5. How do DTDs and XSDs differ? store order order stock customer item item customer item item item id qty price id qty price id qty price id qty item id qty id qty in DTDs, either: item → id, qty, (price + item*) or order_item → id, qty, price stock_item → id, qty, stock_item* can be done in XSDs

  6. XSD: abstract syntax <xsd:element name=" store " type=" store "/> <xsd:complexType name=" store "> <xsd:sequence> <xsd:element name=" order " type=" order " minOccurs="0" maxOccurs="unbounded"/> <xsd:element name=" stock " type=" stock "/> </xsd:sequence> </xsd:complexType> <xsd:complexType name=" order "> <xsd:sequence> <xsd:element name=" customer " type=" customer "/> <xsd:element name=" item " type=" item1 " minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> root → store [ store ] store → order [ order ]*, stock [ stock ] order → customer [ customer ], item [ item 1 ] +

  7. Motivating example for XSD store order order stock customer item item customer item item item id qty price id qty price id qty price id qty item id qty id qty DTD: XSD: root → store root → store [ store ] store → order *, stock store → order [ order ]*, stock [ stock ] order → customer , item + order → customer [ customer ], item [ item 1 ] + stock → item + stock → item [ item 2 ] + item → id , qty , ( price + item *) item 1 → id [ id ], qty [ qty ], price [ price ] item 2 → id [ id ], qty [ qty ], item [ item 2 ]*

  8. Inference of XSDs XML XSD • Problem: infer XSD from XML corpus • Requirement: concise, i.e., humans can interpret/validate • But… theorem [Gold, 1967]: impossible to learn from positive data only

  9. XSD property W3C specs: Element Declarations Consistent (EDC): no elements with distinct type in same content model sometype → item [ item 1 ] + , item [ item 2 ] + content model of an element depends on its context

  10. XML validation for XSD [ store ] store [ order ] [ order ] [ stock ] order order stock [ item 2 ] [ customer ] [ item 1 ] [ item 1 ] [ item 2 ] customer item item customer item item item id qty price id qty price id qty price id qty item id qty [ id ] [ qty ] [ price ] [ id ] [ qty ] [ item 2 ] id qty XSD: root → store [ store ] store → order [ order ]*, stock [ stock ] if XML is valid: order → customer [ customer ], item [ item 1 ] + type assignment is determined stock → item [ item 2 ] + by path from element to root item 1 → id [ id ], qty [ qty ], price [ price ] item 2 → id [ id ], qty [ qty ], item [ item 2 ]*

  11. XML validation for XSD Theorem [Martens et al., 2006] Content model of an element is uniquely determined by the path from the root to that element

  12. XSD observations: local context • Large, diverse corpus of real world XSDs [Bex et al., 2004, Martens et al., 2006] – 98 % of XSDs only local context: relevant ancestor path has length of at most 3, i.e., "greatgrandfather" store order item id qty price

  13. XSD observations: SOREs • Large, diverse corpus of real world XSDs [Bex et al., 2004, Martens et al., 2006] – 99 % of regular expressions is single occurrence • What’s a Single Occurrence RegExp header, protein, organism, reference*, comment*, genetics*, complex*, function*, classification?, keywords?, feature*, summary, sequence authors, citation, volume?, month?, year, pages?, (title + descr)?, xrefs? title, (author, affiliation?) + , abstract • … and what’s not title, ((author, affiliation) + + (editor, affiliation) + ), abstract duplicate element names

  14. Overview • Introduction • Complete algorithm i L OCAL • Heuristic i XSD • Experiments • Conclusions

  15. Main result Theorem: XSDs with local context and SORE content models are learnable from positive examples only

  16. Algorithm i L OCAL store store corpus � order order stock stock customer item item customer item item item item id qty price id qty price id qty price id qty id qty item item id qty → {store} λ store → {order order stock, stock} id qty item item id qty store/order → {customer item item, customer item} store/stock → {item, item item} id qty id qty store/order/item → {id qty price} store/stock/item → {id qty, id qty item item} store/stock/item/item → {id qty item item, id qty} store/stock/item/item/item → {id qty} paths are types [Martens et al., 2006]

  17. Algorithm i L OCAL → {store} λ store → {order order stock, stock} store/order → {customer item item, customer item} store/stock → {item, item item} store/order/item → {id qty price} store/stock/item → {id qty, id qty item item} store/stock/item/item → {id qty item item, id qty} store/stock/item/item/item → {id qty} locality: k = 2 → {store} λ store → {order order stock, stock} store/order → {customer item item, customer item} store/stock → {item, item item} order/item → {id qty price} stock/item → {id qty, id qty item item} item/item → {id qty item item, id qty}

  18. Algorithm i Local → {store} λ store → {order order stock, stock} store/order → {customer item item, customer item} store/stock → {item, item item} order/item → {id qty price} stock/item → {id qty, id qty item item} item/item → {id qty item item, id qty} i SOA, ToSORE [Bex et al., 2006] XSD → store [ store ] → store [ store ] ���� ���� → order [ store/order ]*, stock [ store/stock ] → order [ store/order ]*, stock [ store/stock ] store store → customer [ order/customer ], item [ order/item ] + → customer [ order/customer ], item [ order/item ] + store/order store/order → item [ stock/item ] + → item [ stock/item ] + store/stock store/stock → id [ item/id ], qty [ item/qty ], price [ item/price ] → id [ item/id ], qty [ item/qty ], price [ item/price ] order/item order/item → id [ item/id ], qty [ item/qty ], item [ item/item ]* → id [ item/id ], qty [ item/qty ], item [ item/item ]* stock/item stock/item → id [ item/id ], qty [ item/qty ], item [ item/item ]* → id [ item/id ], qty [ item/qty ], item [ item/item ]* item/item item/item

  19. Algorithm i L OCAL • Theorem: i L OCAL is sound corpus � is valid with respect to inferred XSD • Theorem: i L OCAL is k -complete if corpus � is "sufficiently large" then target XSD is equivalent with inferred XSD

  20. Algorithm M INIMIZE → store [ store ] ���� → order [ store/order ]*, stock [ store/stock ] store → customer [ order/customer ], item [ order/item ] + store/order → item [ stock/item ] + store/stock → id [ item/id ], qty [ item/qty ], price [ item/price ] order/item → id [ item/id ], qty [ item/qty ], item [ item/item ]* stock/item duplicate → id [ item/id ], qty [ item/qty ], item [ item/item ]* item/item types M INIMIZE → store [ store ] ���� → order [ store/order ]*, stock [ store/stock ] store → customer [ order/customer ], item [ order/item ] + store/order → item [ item 2 ] + store/stock → id [ item/id ], qty [ item/qty ], price [ item/price ] order/item → id [ item/id ], qty [ item/qty ], item [ item 2 ]* item 2

  21. Overview • Introduction • Complete algorithm i L OCAL • Heuristic i XSD • Experiments • Conclusions

  22. In practice: incomplete data store corpus � stock/item → {id qty, id qty item item} stock item/item → {id qty item, id qty} i Local, k = 2 item item i SOA, ToSORE id qty id qty item item → id [ item/id ], qty [ item/qty ], stock/item item [ item/item ]* id qty item id qty → id [ item/id ], qty [ item/qty ], item/item item [ item/item ]? id qty M INIMIZE can't minimize! incomplete data �⇒ i Local derives too many types!

  23. Practical heuristics • Define "distance" between types – details: see paper • For types � , � : if ��������� � � � ��� ε , unify � and � = R EDUCE • Our practical algorithm i XSD: � ���� � �� � ����� ����� � � � ���� � � �� � ��

  24. Overview • Introduction • Complete algorithm i L OCAL • Heuristic i XSD • Experiments • Conclusions

Recommend


More recommend