distributed databases design and query execution data
play

Distributed Databases: Design and Query Execution Data - PowerPoint PPT Presentation

Distributed Databases: Design and Query Execution Data Fragmentation and Placement Fragmentation: How to split up the data into smaller fragments? Horizontal Vertical Placement: Where to put the fragments? COMP-577:


  1. Distributed Databases: Design and Query Execution

  2. Data Fragmentation and Placement  Fragmentation:  How to split up the data into smaller fragments? Horizontal Vertical  Placement:  Where to put the fragments? COMP-577: Distributed Database Systems 2

  3. Horizontal Data Fragmentation Problem Statement:   Given a table R (a1, a2, a3,…), containing tuples t1, t2, …  How to distribute R across several sites Horizontal Fragmentation   For each tuple ti, place ti on one of the sites  Fragmentation given through selection  Range partitioning: σ a1 < 10 (R), σ 10 <= a1 < 20 (R), σ 20 <= a1 (R)  loc = ‘Montreal’, loc = ‘Zurich’  Correcness: each tuple (also future tuple) is member of exactly one fragment (completeness and disjointness) COMP-577: Distributed Database Systems 3

  4. Referencing partition  Assume relations Emp(eid, name, dep) and Dep(did, name, location); dep is foreign key referencing Dep  Dep is partitioned into Dep i according to location π did (Dep i ) are disjoint sets Ι  Emp is partitioned according to Dep  Emp i = Emp π did (Dep i ) SELECT eid, name, dep FROM Emp, Dep i WHERE Emp.dep = Dep i .did  Emp i are also disjoint COMP-577: Distributed Database Systems 4

  5. Vertical Data Fragmentation Vertical Fragmentation   For each attribute ai, place for each tuple ti, the value of its attribute ai on one of the sites.  In order to identify tuples, the key is replicated at each fragment!  determined by projection : π K,A1,A2 (R), π K,A3,A4 (R), COMP-577: Distributed Database Systems 5

  6. Fragmentation and Query Execution  LAN  Fragment such that your typical queries are executed fast  Full Scan  SELECT * FROM R  Many result tuples => distribute work  Idea: distribute data evenly among several sites and read data in parallel  Expect scanning time long => assemble and communication overhead neglegible  Point Query  SELECT * FROM R WHERE a1 = 5  Few result tuples = localize work  Idea: if all relevant data on one site, only one site has to scan its partition; no work on the other sites  Expect search time short: assemble and communication overhead have too big impact  WAN  Each query should be executed solely locally COMP-577: Distributed Database Systems 6

  7. Range Partitioning  Partitioning Vector indicating domain ranges of an attribute V=(4;7)  Parallel Query Execution  + good for full scan if even distribution  + good for range queries on partition attribute  + ok for point queries  - not good for range/point queries not on partition attribute  Need for good vector otherwise data skew and execution skew  Oracle provides range partitioning  Local Query Execution  Good if partition related to geography COMP-577: Distributed Database Systems 7

  8. Data Placement  Which partition is stored on which computers  Which partitions should be replicated and where  Highly related to data fragmentation  Optimization Problems:  Efficiency  Final goal  Minimize query response time  Maximize throughput  Achieved by minimizing “costs”  minimize WAN remote access  Optimize parallelism  Minimize bottlenecks (communication system, local machined)  Availability  Choose nodes according to their availability  replication COMP-577: Distributed Database Systems 8

  9. Data Placement  Data Allocation depends on many parameters  Number of nodes  Storage capacity and costs on each site  Number of fragments  Size of fragments  Number of different queries/updates on specific fragments  Frequency of each query/update operation  Size of input query, number of results tuples per query  Communication costs per data unit  What query processing strategies do exist (type of joins etcs.)  This is a hard problem! COMP-577: Distributed Database Systems 9

  10. Query Processing I  SELECT A,B FROM R WHERE E=3  Generate Execution Tree π A,B E = 3 R COMP-577: Distributed Database Systems 10

  11. Query Processing II  Localization 1: Replace Relations by Fragments π A,B E = 3 ∪ [R1: E < 10] [R2: E >= 10] COMP-577: Distributed Database Systems 11

  12. Query Processing III  Localization II  Push ∪ up  Push π and σ down π A,B ∪ π A,B π A,B E = 3 E = 3 E = 3 ∪ [R1: E < 10] [R2: E >= 10] [R1: E < 10] [R2: E >= 10] COMP-577: Distributed Database Systems 12

  13. Query Processing IV  Localization III  Simplify  Eliminate unnecessary operations ∪ π A,B π A,B π A,B E = 3 E = 3 E = 3 [R1: E < 10] [R1: E < 10] [R2: E >= 10] = ∅ COMP-577: Distributed Database Systems 13

  14. Summary  Many different rules to simplify  Is getting very sophisticated for  Joins  Aggregation (sum, max, etc.)  Localization with vertical fragmentation slightly different COMP-577: Distributed Database Systems 14

  15. Heterogeneous Query Processing client client Mediator / catalog Middleware Wrapper Wrapper DB1 DB1 DB2 COMP-577: Distributed Database Systems 15

  16. Work distribution  Mediator  Schema Integration  must know the schemas of the component DB (external schemas)  find a reasonable global schema for the separate schemas  translate the external local schemas of a component DB into global schema and vs. versa  this is its whole own research area  very complex if data sources have different data models (Relational, OO, XML, html, file-system)  Own data dictionary (catalog)  global schema, translation mechanisms, information about local schemas, …  some statistics of the component DB  size of tables, query capacity of component DB, etc. COMP-577: Distributed Database Systems 16

  17. Work distribution  Mediator  global query execution  query parsing, rewrite, query optimization  executes some of the operations (joining, extra operations on the data retrieved from data sources)  might cache data COMP-577: Distributed Database Systems 17

  18. Work distribution  Wrapper (adaptor)  translates every request of the mediator into request that is understood by the underlying DB  translates the results into format understood by the mediator  Wrappers for relational DB, file systems, web-based DB  wrapper is client of DB source: can cache data and execute parts of the query by itself (e.g. if data source only returns html pages)  Expect to have many wrappers for many different data sources  new data source -> new wrapper, but same mediator  Key feature of being extensible  Query optimization and execution  distributed between mediator, wrapper and data source COMP-577: Distributed Database Systems 18

Recommend


More recommend