Enhancing Traditional Databases to Support Broader Data Management Applications Yi Chen Computer Science & Engineering Arizona State University
What Is a Database System? � Of course, there are traditional relational database management systems (RDBMS) � Was introduced in 1970 by Dr. E. F. Codd (of IBM) � Commercial relational databases began to appear in the 1980s � The focus of most work in the past 30 years 2 Yi Chen --- January 23, 2006
A Relational Database (RDBMS) Column (Attribute) Table (Relation) Climber Name Skill age Row James Beginner 21 Bob Experienced 33 (Tuple, Record) Climbs Refers to Name Route Date Duration Bob Last Tango 10/10/05 5 Bob Last Tango 1/10/06 4.5 A predefined data structure (schema) is required. 3 Yi Chen --- January 23, 2006
Querying RDBMS: SQL Climber Name Skill age selection: σ Name = “James” James Beginner 21 Bob Experienced 33 Climbs projection: ∏ Route = “Last Name Route Date Duration Tango” Bob Last Tango 10/10/05 5 Bob Last Tango 1/10/06 4.5 join: Climber Climber.name = climbs.name Climbs Name Skill Age Route Date Duration Bob Experienced 33 Last Tango 10/10/05 5 Bob Experienced 33 Last Tango 1/10/06 4.5 4 Yi Chen --- January 23, 2006
The Advantages of RDBMS � Good data organization � High efficiency for large datasets via indexing and query optimization � Concurrency control and reliability 5 Yi Chen --- January 23, 2006
But, 80% of the World’s Data is Not in RDBMS! Examples: � WWW, Emails � Personal data, documents of various format � Sensor data � A lot of scientific data (experimental data, large images, documentation, etc) � Why not? � There are several assumptions in relational databases that do not fit for handling this data. � My research addresses how to enhance RDBMS to manage them. 6 Yi Chen --- January 23, 2006
Challenges for RDBMS (I) � RDBMS Assumption: data conforms to a predefined fixed schema, which is separated from the data itself � Reality: � Data may be collected from different sources on the web, therefore has different schemas � Schema can change over time for a single source � Requirements: We need to handle data of different schemas and have the schemas tightly associated with the data 7 Yi Chen --- January 23, 2006
XML as a Data Representation Format � XML has become a standard data format for various applications, because of: � Flexibility in schemas -- semi-structured data � Self - describing feature � Representing tree data model naturally 8 Yi Chen --- January 23, 2006
9 XML: the Standard for Web Data GenBank PubMed BLAST ... Yi Chen --- January 23, 2006 Internet XML Data XML Data Representation NCBI Web Service Publisher Web Service Requester
XML: Representing Phylogenetic Trees From the Tree of the Life Website, University of Arizona Orangutan Human Gorilla Chimpanzee 10 Yi Chen --- January 23, 2006
Challenges for RDBMS (II) � RDBMS Assumption: Data is clean and consistent. � Reality: real world data is dirty � Data collected from different sources may have missing and conflicting information � Data that is obtained from data mining is often not error-prone � Experimental data often contains random errors � Requirements: we need to measure data quality and handle imprecise and/or incomplete data 11 Yi Chen --- January 23, 2006
Roadmap of This Talk � Managing XML by leveraging mature RDBMS [Chen et al 04] � Introduction to XML � A generic and efficient XML-to-RDBMS mapping � Data mapping from trees to tables � Query translation from tree navigation queries to SQL queries that are efficient � Handling imprecise and incomplete data in DBMS [Chen et al 06] 12 Yi Chen --- January 23, 2006
Sample XML Data <books> books ... <book> book <title> The lord of the rings... ... title section </title> “The lord of the <section> section title rings …” “Locating <title> middle- ... Locating middle-earth title figure earth” “A hall </title> ... fit for a </section> … description king” “King Theoden's </book> golden hall” </books> 13 Yi Chen --- January 23, 2006
Sample XML Queries books � XML query languages are ... based on hierarchical book structure navigation (e.g. XPath) ... title section “The lord of the title section � Sample queries: rings …” “Locating � What are all the section middle- ... title figure titles: //section/title earth” “A hall fit for a description king” “King Theoden's Descendant axis Child axis golden hall” 14 Yi Chen --- January 23, 2006
Sample XML Queries books � XML query languages are ... based on hierarchical book structure navigation (e.g. XPath) ... title section “The lord of the title section � Sample queries: rings …” “Locating � What are all the section middle- ... title figure titles: //section/title earth” “A hall fit for a � What are the titles of description king” “King Theoden's sections that contain a golden hall” figure: //section[/figure]/title Predicates 15 Yi Chen --- January 23, 2006
How to Query XML Data efficiently? � RDBMS have achieved high performance in query evaluation. � Can we leverage RDBMS by encoding XML to tables? 16 Yi Chen --- January 23, 2006
17 Analogy: Fourier Transforms G(f)H(f) Efficient Yi Chen --- January 23, 2006 g * h = ∫∫ - ∞ g(u)h(u)du Complex + ∞
Mapping XML Data to RDBMS Challenge: How to build the bridge between hierarchies and XPath tables? Query Translation XML SQL fragments Storage Mapping XML data Relational databases 18 Yi Chen --- January 23, 2006
Data Mapping Parent ID (1) books [Florescu & Kossmann 99] (2) book T ID Tag Value Structural (3) (4) Information title section 1 books “The lord of (5) the rings …” 2 book section title 3 title The... “Locating 4 section middle- 5 title Locating… title figure earth” “A hall … … … … fit for a description king” “King Theoden's golden hall” 19 Yi Chen --- January 23, 2006
Data Mapping Design special labels (1) . to encode node books relationships (2) book T ID Tag Value Structural (3) (4) Information title section 1 books “The lord of (5) the rings …” 2 book section title 3 title The... “Locating 4 section middle- 5 title Locating… title figure earth” “A hall … … … … fit for a description king” “King Theoden's golden hall” 20 Yi Chen --- January 23, 2006
Query Translator Architecture XPath Sub-query SQL sub-query SQL XPath decomposition translation composition Query Translator � How to choose XPath subqueries, such that: � they can be easily translated to SQL subqueries � the SQL subqueries can be efficiently evaluated � How to combine SQL subqueries to a complete one? 21 Yi Chen --- January 23, 2006
22 Q: //book[//figure]/section/title section title Yi Chen --- January 23, 2006 book Query Translator figure
23 Query Translator: (I) Decomposition to section Q: //book[//figure]/section/title book title Yi Chen --- January 23, 2006 book figure Suffix Paths
Encoding Suffix Paths Using P-labeling (342000,343000) σ 342000 ≤ Plabel ≤ 343000 T //book/section/title books books ... ... (1) (1) book book (2) (2) (3) (3) (4) (4) T ... ... title title section section id Plabel “The lord “The lord 1 100000 of the of the (100) (100) (5) 2 210000 section section title title rings …” “Locating rings …” “Locating 3 321000 middle- middle- ... ... 4 421000 title title figure figure earth” “A hall earth” “A hall 5 342100 … … Evaluating suffix paths fit for a fit for a description description king” king” “King Theoden's “King Theoden's SQL selections on P-labels golden hall” golden hall” 24 Yi Chen --- January 23, 2006
Query Translator: (II) Selection on P-labels 25 section Q: //book[//figure]/section/title book title Yi Chen --- January 23, 2006 book figure
D-labeling Scheme books ... • D-labeling is used to connect (1, 20000, 1) suffix paths. book (6, 1200, 2) (10,80,3) (81, 250,3) • D-labels (start, end, depth) ... title section “The lord can be used to detect of the (100, 200,4) ancestor-descendant section title rings …” “Locating relationships between nodes middle- in a tree. ... (120, 160, 5) title figure earth” “A hall fit for a description king” “King Theoden's golden hall” 26 Yi Chen --- January 23, 2006
Recommend
More recommend