9 17 2010
play

9/17/2010 Todays lecture Advanced databases and data models: - PDF document

9/17/2010 Todays lecture Advanced databases and data models: Native XML management Shredding Hybrid solutions Theme3: Efficient storage of XML SQL/XML HShreX HShreX Efficency Lena Strmbck June 17, 2009 1 XML as a data model


  1. 9/17/2010 Today´s lecture Advanced databases and data models: Native XML management Shredding Hybrid solutions Theme3: Efficient storage of XML SQL/XML HShreX HShreX Efficency Lena Strömbäck June 17, 2009 1 XML as a data model Storage possibilities for XML XML is richer than the relational model Tree structure, Order Storage Loading Querying design … XML XML XML XQuery Que y Vary from highly structured to unstructured Vary from highly structured to unstructured schema schema Docs Docs results results Database export … Native XML Annotated text documents DBMS Can contain links to other type of entities What does this mean for efficient storage? 1

  2. 9/17/2010 Native XML databases Storage possibilities for XML • Defines a (logical) model for an XML document Storage Loading Querying design • Elements, attributes, PCDATA, document order. XML XML XML XQuery schema Docs results • The XML document is the logical unit of storage Storage Loading Querying design • Can have any physical underlying storage model Mapping layer XML XML XML • Often: grouping documents, collections Often: grouping documents, collections XQuery XQuery schema schema Docs Docs results results Tuples • Query model: Xpath and in most cases Xquery Relational SQL Relational schema results Native XML Relational • Examples: DBMS DBMS • eXist http://exist-db.org/ • MarkLogic http://www.marklogic.com Shredding How to shred XML? Related work <?xml version="1.0" encoding="UTF-8"?> <families Families xmlns:xsi=http://www.w3.org/2001/XMLSchem Id Pid a-instance> Shredding of XML into relational storage <family> 0 - <parent> Family <name>Lena</name> <job>Lektor</job> Bohannon et al. 2002, (ICDE), Id Pid </parent> Florescu and Cossman 1999 (IEEE Data Eng), <child> 1 0 <name>Ludvig</name> Georgiadis and Vassalos 2007 (SIGMOD), g ( ), <school>Skolan</school> Parent Parent </child> Id Pid Name Job </family> </families> 2 1 Lena Lektor Grust et al. (2007) (SIMOD), Child Source Ordinal attrName isValue Value Mlynkova 2009 (DEXA) Id Pid Name School 0 1 Families False 1 3 1 Ludvig Skolan 1 1 Family False 2 2 1 Parent False 3 3 1 Name True Lena 3 2 Job True Docent .. 2

  3. 9/17/2010 Hybrid XML Storage New possibilities… <model id="Tyson1991CellModel_6" name="Tyson1991_CellCycle_6var"> <listOfSpecies> Species: + <species id="C2" name="cdc2k" compartment="cell"> + <species id="M" name="p-cyclin_cdc2" compartment="cell"> XML data Id Name Compartment + <species id="YP" name="p-cyclin" compartment="cell"> … XQuery C2 cdc2k cell </listOfSpecies> M p-cyclin_cdc2 cell XML results <listOfReactions> YP p-cyclin cell <reaction id="Reaction1" name="cyclin_cdc2k dissociation"> …. …. …. <annotation> <rdf:li rdf:resource="http://www.reactome.org/#REACT_6308"/> Reaction: XML data <rdf:li rdf:resource="http://www.geneontology.org/#GO:0000079"/> Id Name Annotation Formula </annotation> SQL/XQuery Mapping app g Reaction1 cyclin_cdc2k <annotation <kinetic_law …..> <listOfReactants> dissociation …..> XML /R l ti XML /Relational l <speciesReference species="M"/> f / Reaction2 cdc2k <annotation <kinetic_law ….> layer </listOfReactants> phosphorylation ….> results …. …. …. …. <listOfProducts> Reactants: Native <speciesReference species="C2"/> <speciesReference species="YP"/> Id Species </listOfProducts> Hybrid Reaction1 M <kineticLaw> Reaction2 …. <math xmlns="http://www.w3.org/1998/Math/MathML"> …. …. <apply> <times/> <ci> k6 </ci> <ci> M </ci> </apply></math> Products: <listOfParameters> <parameter id="k6" value="1“> </listOfParameters> Id Species </kineticLaw> Reaction1 C2 Hybrid Reaction1 YP </reaction> …. …. + <reaction id="Reaction2" name="cdc2k phosphorylation"> DBMS ... more reactions </listOfReactions> </model> </sbml> Hybrid storage and SQL/XML XML type CREATE TABLE: Hybrid XML storage Beyer et al. create table sbml_data (sbml_doc XML); Hua Liu et al. Rys et al. (2005) (SIGMOD) create table reaction ( id varchar(100) not null, Proposed as SQL standard name varchar(250), math XML, Available in: DB2, Microsoft SQL Server, Oracle annotation XML, primary key(id) More limited versions in other databases…. ) 3

  4. 9/17/2010 Querying Querying select select reactome_doc.query( sbml_doc.query('/sbml/model/listOfSpecies/species[@id 'for $react in /model/listOfReactions/reaction return <path> = "C2"]') <from> {data($react/listOfReactants/speciesReference/@species)} from sbml_data; </from> <via> {data($react/@id)} </via> { ($ @ )} <species id="C2" name="cdc2k" compartment="cell"> <to> {data($react/listOfProducts/speciesReference/@species)} </to> <annotation> …</annotation> </path>') from reactome_data; </species> <path><from>M</from><via>Reaction1</via><to>C2</to></path> <path><from>M</from><via>Reaction1</via><to>YP</to></path> … Transforming data to relations Transforming data to XML insert into reaction(id, name, annotation, formula) select reaction.reaction as name, reaction.id, (select select r.value('@id','varchar(100)') as id, (select speciesReference.reactant as species from reactants r.value('@name','varchar(250)') as name, speciesReference where reaction.id = speciesReference.id r.value('annotation','xml') as annotation, for XML auto, type) r value('kinetic formula' 'xml') as formula r.value( kinetic formula , xml ) as formula, from emptyXML as listOfReactants for XML auto, type), from (select (select (select speciesReference.product as species from products sbml_doc speciesReference where reaction.id = speciesReference.id .nodes(sbml/model/listOfReactions/reaction‘) for XML auto, type) from sbml_data) as d(r); from emptyXML as listOfProducts for XML auto, type) , from reaction for XML auto, type 4

  5. 9/17/2010 Syntax differs between DBMS´s (DB2) Efficiency: Increasing query complexity select sbml_doc.query( 'for $y in /model/listOfReactions/reaction/listOfModifiers/modifierSpeciesReference, $z in /model/listOfSpecies/species[@id = $y/@species] 4000 return <modifier> {$y/@species} {$y/../../@id} {$z/@compartment} </modifier>') from reactome_data; Native xquery Designed shredding 2000 for Automatic shredding $y in db2-fn:xmlcolumn('REACTOME_DATA.REACTOME_DOC')/model/ listOfReactions/reaction /listOfModifiers/modifierSpeciesReference, $z in db2-fn:xmlcolumn('REACTOME_DATA.REACTOME_DOC')/model/ listOfSpecies/species[@id = $y/@species] 0 return <modifier> {$y/@species} {$y/../../@id} {$z/@compartment} </modifier> Species Path Path (2 step) Path (3 step) Path (4 step) Efficiency: Efficiency: Combining representations Return the result as XML 500 400 4000 XQuery 300 XQuery (rdbms 1) Que y ( db s ) SQL (designed) SQL (rdbms1) 200 SQL (automatic) 2000 Mixed 100 Species Species Species Reaction Reaction Reaction 0 (1) (100) (1000) (1) (500) (10000) 3.4 5

  6. 9/17/2010 Higher complexity What are we doing? Numerous alternatives for XML storage Efficiency studies How to achieve models that are efficient and easy to use Dependent on application, XML data and query load Guidelines Tool development Guidelines: HShreX – a tool for evaluation Keep together what naturally belong together Queries XML data Do not shred parts of the XML where the schema allows large variation Take variations of the actual data into account XML Relational Shred elements that are critical for performance p Schema schema DBMS + + Prefer the representation that is required for query Annotations Shredding rules results Avoid shredding where future versions of the schema is likely to change. Avoid shredding if parallel versions of data is to be kept Extension of an old tool Shrex to allow hybrid storage 6

Recommend


More recommend