Relational Databases for Querying Outline XML Documents: • Quick Review on XML Limitations and Opportunities • The Problem Definition • General Approach to Querying XML using a Jayavel Shanmugasundaram RDBMS Kristin Tufte Gang He • The focus: XML DTD to Relational Schema Chun Zhang – General idea and problems David DeWitt – 3 inlining techniques Jeffrey Naughton • Basic • Shared Presented by Mingwei JI • Hybrid Discussion led by Atsushi WATASE • Summary What is the big deal about XML? XML in One Slide • eXtensible Markup Language (XML) • Fast emerging as dominant standard for data • Hierarchical document format for information representation on WWW exchange in WWW • Self describing data (tags) • Exciting database opportunity: • Semi-structured – Unlike HTML, tags are not only for presentation • Nested element structure having a root – Can capture semantics • Element data can have – Can query the web if we can query XML!!! – Attributes – Sub-elements XML Example DTDs: Schema for XML Docs <book> <!ELEMENT book (booktitle, author)> <booktitle> The Selfish Gene </booktitle> <author id = “dawkins”> <!ELEMENT booktitle (#PCDATA)> <name> <firstname> Richard </firstname> <!ELEMENT author (name, address)> <lastname> Dawkins </lastname> <!ATTLIST author id ID #REQUIRED> </name> <address> … </address> </author> </book> 1
The Problem Approach • Given: • Process a DTD to generate a relational – DTDs schema ( our focus today ) – Collection of XML documents conforming • Parse XML documents conforming to to DTDs DTDs and load them into tuples of • Query: RDBMS – Based on DTD schemas • Translate semi-structured queries over – Over collection of XML documents, XML documents into SQL queries over performing selections, joins, etc. RDBMS – Producing an XML result • Convert the results back to XML Approach DTD to Relational Schema • General Approach: XML XML-QL XML DTD Documents Query Result – Each Element ==> Relation Automatic Translation Layer – Each Attribute of Element ==> Column of Relation Relational SQL Relational Tuples Schema – Connect elements using foreign keys Query Result Translation • Problem? Information – Fragmentation! Commercial RDBMS Fragmentation: Example DTD to Relational Schema <!ELEMENT author (name, address)> author (authorID: integer, id: string) <!ATTLIST author id ID #REQUIRED> name (nameID: integer, authorID: integer) <!ELEMENT name (firstname?, lastname)> firstname (firstnameID: integer, nameID: integer, value: string) • “inlining” <!ELEMENT firstname (#PCDATA)> lastname (lastnameID: integer, nameID: integer, <!ELEMENT lastname (#PCDATA)> value: string) “Put as many descendants or sub-elements of <!ELEMENT address ANY> address (addressID: integer, authorID: integer, value: an element as possible into a single relation” string) • Results in 5 relations • Just retrieving first and last names of an author requires three joins! 2
book monograph article DTD Graph Basic Inlining Technique • Intuition: ? title booktitle – Inline as many sub-elements as possible * editor contactauthor – Do not inline only if it is a set sub-element. * – Connect relations using foreign keys authorID • Complications: author name – A document can be rooted at any element • Create separate relational schema for each name authorid address root – Recursion ? • Detect cycles in schema firstname lastname Basic : Relational Schema Basic Inlining Technique: Pros book (bookID: integer, book.booktitle : string, book.author.name.firstname: string, • Reduces number of joins for queries like book.author.name.lastname: string, “get the first and last names of a book book.author.address: string, author.authorid: string) author” … • Efficient for queries such as “list all author (authorID: integer, authors of books” author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) … address (…) Basic Inlining Technique: Cons Experimental Results • Queries like “list all authors with name • Is Basic practical? Jack” – Many DTDs have recursion – Union of 5 queries! – Large strongly connected components • Generate a Large number of relations: – Schema translation program ran out of virtual memory!!! – Separate relational schema for each element as root • Concentrate on Shared vs. Hybrid 3
Shared Inlining Technique Shared Inlining Technique • Intuition: The principal idea behind “ Shared” – Inline as many sub-elements as possible – To identify the element nodes that are – Do not inline only if it is a shared, recursive represented in multiple relations in “ Basic ” or set sub-element. and to share them by creating separate • Technique: relations for these elements. – Necessary and Sufficient Condition for – To ensure that an element node is shared/ recursive element: represented in exactly one relation In-degree >= 2 in DTD graph Shared: Relational Schema Shared Inlining Techniques: Pros book (…) article (…) monograph (…) title (…) + Reduces number of joins for queries like author (authorID: integer, “get the first and last names of an author” author.parentID: integer, + Efficient for queries such as “list all author.parentCODE: integer, authors with name Jack” author.name.isroot: boolean, author.name.firstname.isroot: :boolean, author.name.firstname: string, author.name.lastname.isroot: boolean, author.name.lastname: string, author.address.isroot: boolean, author.address: string, author.authorid: string) Shared Inlining Technique: Cons Hybrid Inlining Technique - Sharing whenever possible implies extra • Inlines some elements that are shared in joins for path expressions Shared • “Article with a given title name” – Elements with in-degree >= 2 that are not set sub-elements or recursive • Handles set and recursive sub-elements as in Shared 4
Hybrid: Relational Schema Hybrid Inlining Technique: Pros book (…) + Reduces joins through shared elements article (…) monograph (…) (that are not set or recursive elements) author (authorID: integer, author.parentID: integer, + Shares some strengths of Shared : author.parentCODE: integer, • Reduces joins for queries like “get first and author.name.isroot: boolean, author.name.firstname.isroot: boolean, last names of a book author” author.name.firstname: string, author.name.lastname.isroot: boolean, author.name.lastname: string, author.address.isroot: boolean, author.address: string, author.authorid: string) Hybrid Inlining Technique: Cons Group Discussion • Q1. Assume we’ll store XML documents into database. - Requires more SQL sub-queries to Which one would you pick and why? retrieve all authors with first name Jack – Create an XML database from scratch • Tradeoff between reducing number of queries – Use a relational backend and reducing number of joins • Q2. In what condition, would you pick another one? • Shared and Hybrid target query- and join- • Q3. Assume we pick “relational” to the Q1. Which inlining reduction respectively technique would you use? – Basic – Shared – Hybrid Summary Reference • Relational model can be used to query • http://www.cs.wisc.edu/~jai/pubs.html XML documents by Jayavel Shanmugasundaram • 4 steps • Converting XML DTD to relations: – General idea – 3 Inling Techniques • Basic • Shared • Hybrid 5
Thank You! 6
Recommend
More recommend