5. Storage and Manipulation Foundations of XML Data of SSD Manipulation Shamelessly “inspired” by Ioana Manolescu tutorial Giorgio Ghelli The problem Classifying stores • Consider the queries • Essential criteria: – $doc // e-mail – Clustering – $doc //.[name = 'ghelli']/e-mail – Encoding of parent/child relationship • We do not want to bring the whole $doc in main memory • Set manipulation rather than tuple manipulation OEM data model Storing OEM Bib • No schema! &o1 complex object • Storing objects in LORE: paper paper book – Store the graph, clustered in depth-first references references &o12 &o24 &o29 order author pages author author title http titlepublisher title author year author – Operator: Scan(document,path), returns a &o43 &25 &96 set of objects 1997 last firstname firstname lastname first lastname atomic object &243 &206 “Abiteboul” “Serge” “Victor” “Vianu” 122 133
Indexing in LORE Access plans • Top down or bottom up navigation? • VIndex( l , o, pred ): all objects o with an select x incoming l-edge, satisfying the predicate from A.B x • LIndex( o , l , p): all parents of o via an l - where x.C = 5 edge A A A D D • BIndex(x, l , y): all edges labeled l D D B B B B B B B B B C C C C C C C C C 5 5 5 4 4 5 4 4 5 Access plans: bottom up Bottom up and top down select x select x from A.B x from A.B x where x.C = 5 where x.C = 5 Return(n2) Return(n2) Name(n3,'A') Name(n3,'A') Return(n1) NLJoin LIndex(n2,'B',n3) Select(n2,'=5') NLJoin LIndex(n2,'B',n3) LIndex(n1,'C',n2) Scan(n1,'C',n2) VIndex('C',n1,'=5') LIndex(n1,'C',n2) VIndex('C',n1,'=5') Scan('A','B',n1) Hybrid access plans Path indexes select x • PIndex( p , o): all objects reachable by from A.B x the path p where x.C = 5 Intersect Project(n2) Project(n2) Name(n1,'A') LIndex(n3,'C',n2) BIndex(n1,'B',n2) VIndex('C',n3,'=5')
Using a path index DataGuides select x • Introduced in Lore: a compact representation of all paths in the graph from A.B x • A DG for s is an OEM object d such that: where x.C = 5 – Every path of s reaches exactly one object in d – Every path in d is a path for s Intersect • DG: a schema a posteriori • Used to: PIndex('A.B',n2) Project(n2) – Inform the user LIndex(n1,'C',n2) – Expand wildcards in paths – Inform the optimizer VIndex('C',n1,'=5') DataGuides for trees Building a DataGuide A A A • Similar to converting a NFA to a DFA • Linear time for trees D D B D B D B • Exponential in time and space for C C C C C C graphs A A A • In practice, works well for regular ? structures D D B D B D B A C C A C C A C Which dataguide is better? Optimization via dataguide • Minimal dataguide • Expanding paths: a//z => a/b/z | a/c/d/z • Strong dataguide: if p1 and p2 both reach the • Deleting paths that are not in the data same node in d , then p1 and p2 have the • Contracting paths: a/c/e/d/z => //d/z same target set in s – Each target-set in the source has its own node – May be more efficient for a bottom-up and in the guide evaluation – Easy to build • Keeping statistic information – Easy to maintain, by keeping track of the many-to- many node correspondence between s and d – However, statistic are needed for every k- length path, not just for rooted paths
Graph Schemas Graph schemas A A A • Each edge in the scheme is labeled by a label predicate (a set of labels) D D B D B D ∨ B • Predicates are deterministic A C C A C C A C • Conformance is defined by a simulation between s and d: A – Root of data in root of schema – For every n1 in d1 with n1-l-n2, we have d1-l-d2 B C ¬ (B ∨ C) with n2 in d2 B • No request for surjectivity, or injectivity D E Graph indexing • Group nodes in sets, possibly disjoint • Store the extent of each set XML Storage in RDBMS • Grouping criteria: – Reachable by exactly the same Forward paths: 1- index – Indistinguishable by any F&B path: FB-index – Indistinguishable by the paths in a set Q: covering indexes – Indistinguishable by any path longer than k: A(k) index Using RDBMS for XML Storing data • Advantages • Data may be schema-less – Transactions • Data may have a schema – Optimization • Data may change its schema over time • Issues – Data storage – Query translation
Approaches Unknown schema: the edge table From Pos Tag To Data employees 1 • Based on schema: - 1 employees 1 1 1 emp 2 – Based only on schema 2 1 ID 3 1 emp 11 emp 21 2 2 FN 4 Nancy emp – Based on schema + cost informations 2 2 3 BD 5 5 1 Day 6 8 • Unknown schema: FstNm BirthD HiringD ID 5 2 Month 7 dec 3 4 5 9 5 3 Year 8 1968 – Derive schema from data Day 6 M 7 1 N Y 8 2 4 HD 9 8 dec 1968 – Generic approach 9 … … … 1 2 emp 11 • User defined 11 … … … 1 3 emp 21 Navigating the edge table Partitioned edge table employees From Pos Tag To Data • //FN/text(): - 1 employees 1 From Pos To Data select e.Data from edge e where e.Tag = 'FN' 1 1 emp 2 - 1 1 2 1 ID 3 1 • //emp[ID=‘1’]/FN/text() 2 2 FN 4 Nancy emp select e3.Data 2 3 BD 5 From Pos To Data srom edge e1, edge e2, edge e3 5 1 Day 6 8 1 1 2 5 2 Month 7 dec where e1.Tag = ‘emp’ and e1.to = e2.from 1 2 11 and e2.Tag = ‘ID’ and e2.Data = 1 5 3 Year 8 1968 1 3 21 and e1.to = e3.from and e3.Tag = ‘FN’ 2 4 HD 9 9 … … … ID • Navigation through multi-way join (XPath to 1 2 emp 11 FO translation) From Pos To Data 11 … … … 2 1 3 1 1 3 emp 21 Navigation Related storage schemes • //emp[ID=‘1’]/FN/text() • The universal relation: select e1.Data – employees � � � emp � � �� ID � � �� FN � � �� … from edge e1, edge e2, edge e3 • Materialized views over edges: where e1.Tag = ‘emp’ and e1.to = e2.from and e2.Tag = ‘ID’ and e2.Data = 1 – emp � � ID � � FN � � HD … and e1.to = e3.from and e3.Tag = ‘FN’ � select FN.Data • The STORED approach: from emp, ID, FN – Materialized views based on pattern where emp.to = ID.from and ID.Data = 1 frequencies in the database and emp.to = FN.from – Overflow tables for the rest • Joining smaller tables
Flat storage Path partitioning in Monet employees • For each root-to-inner-node path: emp emp emp – Path(n1,n2,ord): FstNm BirthD HiringD ID • employees.emp{(1,2,1);(1,11,2);(1,21,3)} • employees.emp.ID{(2,3,1);…} Day 6 M 7 1 N Y 8 8 dec 1968 • For each root-to-leaf path: – Path(n1,val) ID First Name BD-D BD-M BD-Y • employees.emp.ID.text{(3,’1’);…} 1 Nancy 8 dec 1968 2 Andrew 19 feb 1952 • Path summary 3 Janet 30 aug 1963 • No join for linear path expressions 4 Margaret 19 sep 1958 XRel approach: interval coding XParent • Tables: • Tables: – Path(PathID,PathExpr) – LabelPath(ID, lengh, pathExpr) – Element(DocID, PathID, Start, End, Ordinal) – Data(pathID, nID, ord, value) – Text(DocID, PathID, Start, End, Value) – Element(pathID, ord, nID) – Attribute(DocID, PathID, Start, End, Value) – ParentChild(pID, cID) • Ancestor relation: • Use ParentChild instead of interval – N1.start< N2.start and N2.end > N1.end coding: equi-join instead of <-join • Path expression: regexp matching with Path table, join the result with the data tables Schema-driven storage Sharing the address DB Employees Depts DB Employees Depts Employee* Dep* Name Employee* Dep* Address Name Street Number E-mail* Address GN FN Street GN FN Number Phone? Employee(PId,Id,Name,Name.GN,Name.FN) Dep(PId,Id) DB(Id,Employees,Depts) Address(PId,Id,Address,Addr.Street,Addr.Number) Employee(PId,Id,Name,Name.GN,Name.FN,Phone) E-Mail(PId,Id,E-mail) Employee(PId,Id,…,Address,Addr.Street,Addr.Number) Depts(PId,Id,Address,Addr.Street,Addr.Number) Dep(PId,Id,Address,Addr.Street,Addr.Number)
Recommend
More recommend