advanced database management systems
play

Advanced Database Management Systems Distributed DBMS:Introduction - PowerPoint PPT Presentation

Advanced Database Management Systems Distributed DBMS:Introduction and Architectures Alvaro A A Fernandes School of Computer Science, University of Manchester AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 1 / 121 Outline


  1. Distributed DBMS Architectures Distributed DBMS Architectures (2) Multi-DBMS from Component Schemas ◮ In the figure, ’multi-’ is being used to connote the notion that the component DBMSs have no coercion on their autonomy nor on how heterogeneous they make themselves. ◮ In this case, a GCS does not normally arise by negotiation (e.g., there may be more than one GCS if the component DBMSs have public interfaces). ◮ The component DBMSs may still have local external schemas (LES) imposed upon them. ◮ The GCS too can have global external schemas (GES) imposed upon it. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 19 / 121

  2. Distributed DBMS Architectures Distributed DBMS Architectures (3) Multi-DBMS without a Global Schema ◮ The absence of a global schema means that only partial views arise, i.e., there is no attempt at a unified description of all the component DBMSs. ◮ This is more likely in the case of ad-hoc, single-use scenarios, where there is no motivation to invest on creating a global view over the resources. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 20 / 121

  3. Distributed DBMS Architectures Distributed DBMS Architectures (4) Multi-DBMS Execution Model ◮ In a multi-DBMS there is a need to map a global request into local sub-requests and local sub-results into a global result. ◮ The component DBMSs still cater for local requests with local results. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 21 / 121

  4. Distributed DBMS Architectures Distributed DBMS Architectures (5) Time-Shared Access to a Centralized Database ◮ In mere time-sharing of a centralized DBMS, all data and all applications run remotely from the point of access. ◮ Requests are for batch tasks, a response (and not necessarily results) is sent back. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 22 / 121

  5. Distributed DBMS Architectures Distributed DBMS Architectures (6) Multiple Clients/Single Server ◮ In client-server approaches, clients are applications that interface through client-side services and communications with a server. ◮ The server runs server-side services in response to client requests. ◮ Because of the client-side services that support the application, high-level, fine-grained, interactive requests can be sent that cause results (i.e., filtered answers only) to flow back. ◮ In general, the client-side services offer query language interfaces (perhaps language-embedded, or form-based, or both). AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 23 / 121

  6. Distributed DBMS Architectures Distributed DBMS Architectures (7) Pros/Cons of Client-Server Architectures Pros Cons (vis-` a-vis other distribution ◮ More efficient division of labor strategies) ◮ Client-side scale-up and ◮ Possible bottleneck and single scale-out point of failure in the server ◮ Better price/performance on ◮ Server-side scale-up and client machines scale-out less easy ◮ Ability to use familiar tools on client machines ◮ Client access to remote data ◮ Full DBMS functionality provided to many ◮ Overall better system price/performance AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 24 / 121

  7. Distributed DBMS Architectures Distributed DBMS Architectures (8) Multiple Clients/Multiple Servers ◮ Distributing server-side load is possible. ◮ Mechanisms become more complex at the lower levels. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 25 / 121

  8. Distributed DBMS Architectures Distributed DBMS Architectures (9) Co-Operating Servers Once servers start co-operating, one is coming close to a truly distributed DDBMS. The newest classes of DDBMSs have arisen in the last five year as a result of pressure to maintain ◮ extremely large repositories of either structured or unstructured data supporting ◮ workloads consisting of ◮ either relatively few computationally intensive analyses ◮ or an extremely large amount of relatively simple retrieval or update requests. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 26 / 121

  9. Distributed DBMS Architectures Summary Distributed DBMS Architectures ◮ DDBMSs have risen in importance due to structural changes in the computing landscape that saw the networking of high-quality PCs become the norm. ◮ Even so, they still retain their original role of emulating the operational decentralization of organizations. ◮ DDBMS architectures capitalize on localization and parallelization to offer a potential for performance gains. ◮ Nonetheless, autonomy and heterogeneity levels can create significant hurdles for full distribution. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 27 / 121

  10. Distributed DBMS Architectures Advanced Database Management Systems Data Distribution Strategies Alvaro A A Fernandes School of Computer Science, University of Manchester AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 28 / 121

  11. Outline Distributed DBMSs: The Design Problem Data Distribution Strategies Fragmentation and Allocation Fragmentation, in More Detail AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 29 / 121

  12. Distributed DBMSs: The Design Problem Distribution Strategies (1) The Design Problem ◮ In the general setting, we need to decide: ◮ the placement of data and programs ◮ across the sites of a computer network ◮ as well as possibly designing the network itself ◮ In DDBMS, the placement of applications entails: ◮ placement of the distributed DBMS software ◮ placement of the applications that run on the database AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 30 / 121

  13. Distributed DBMSs: The Design Problem Distribution Strategies (2) Dimensions of the Problem ◮ Whether only data is partitioned across sites (and programs are replicated everywhere) or whether programs are partitioned too ◮ Whether the access patterns are stable or not ◮ Whether knowledge of such access patterns is complete or not AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 31 / 121

  14. Distributed DBMSs: The Design Problem Distribution Strategies (3) Design Approaches Top-Down : only possible, in practice, when the system is being designed from scratch, and only lasts if heterogeneity and autonomy are tightly controlled Bottom-Up : only practical solution when the component databases already exist at a number of sites, and more likely to last when heterogeneity and autonomy cannot be controlled AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 32 / 121

  15. Data Distribution Strategies Data Distribution Strategies (4) Some Design Issues ◮ Why fragment at all? ◮ How to fragment? ◮ How much to fragment? ◮ How to ensure correctness of fragmentation? ◮ How to allocate fragments? ◮ What information is required? AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 33 / 121

  16. Data Distribution Strategies Data Distribution Strategies (5) Fragmentation (1) ◮ Why can’t we just distribute relations? ◮ Because most relations are designed to be suitable for a great many applications, and different applications may be subject to different locality aspects and offer different parallelization opportunities. ◮ What is a reasonable unit of distribution? ◮ Roughly, that view on a relation that is needed by one or more applications in one place AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 34 / 121

  17. Fragmentation and Allocation Data Distribution Strategies (6) Fragmentation (2) Consider the case of entire relations as the unit of distribution: ◮ Most relations have subsets whose semantics characterize special affinity (e.g., of location, of timing, etc.). ◮ For example, in a relation Employees, the attribute Department may characterize location affinity if different departments occupy different locations. ◮ If so, then unnecessary communication may be incurred if we distribute entire relations. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 35 / 121

  18. Fragmentation and Allocation Data Distribution Strategies (7) Fragmentation (3) Consider the case of sub-relations as the unit of distribution: ◮ A sub-relation, referred to as a fragment in the DDBMS context, is what is specified by a view (typically by selection or projection or both). ◮ Fragmentation can be derived in knowledge of applications and their affinities and allows parallel/distributed execution. ◮ For example, if Employee is horizontally fragmented by the attribute Department, and different fragments are held where the corresponding department is located, computing the average salary in each department can be done in parallel. ◮ If, after fragmentation, a particular query/view cannot be defined over a single fragment, then extra processing will be needed. ◮ Also, semantic checks may be more difficult (e.g., enforcing referential integrity). AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 36 / 121

  19. Fragmentation and Allocation Data Distribution Strategies (8) Fragmentation Alternatives: Horizontal (1) ◮ Broadly speaking, defined by a selection. ◮ Reconstruction is by union. Example PROJ 1 ← σ budget < 200000 ( PROJ ) PROJ 2 ← σ budget ≥ 200000 ( PROJ ) ← PROJ 1 ∪ PROJ 2 PROJ AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 37 / 121

  20. Fragmentation and Allocation Data Distribution Strategies (9) Fragmentation Alternatives: Horizontal (2) Example PROJ1 = PNO PNAME BUDGET LOC P1 Instrumentation 150000 Tokyo P2 Database Develop. 135000 Oslo PROJ2 = PNO PNAME BUDGET LOC P3 CAD/CAM 250000 Oslo P4 Maintenance 310000 Paris P5 CAD/CAM 500000 Paris PROJ = PNO PNAME BUDGET LOC P1 Instrumentation 150000 Tokyo P2 Database Develop. 135000 Oslo P3 CAD/CAM 250000 Oslo P4 Maintenance 310000 Paris P5 CAD/CAM 500000 Paris AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 38 / 121

  21. Fragmentation and Allocation Data Distribution Strategies (10) Fragmentation Alternatives: Vertical (1) ◮ Broadly speaking, defined by a projection. ◮ Reconstruction is by a natural join on the replicated key. Example PROJ 1 ← π PNO , BUDGET ( PROJ ) PROJ 2 ← π PNO , PNAME , LOC ( PROJ ) ← PROJ 1 ⊲ ⊳ PNO PROJ 2 PROJ AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 39 / 121

  22. Fragmentation and Allocation Data Distribution Strategies (11) Fragmentation Alternatives: Vertical (2) Example PROJ1 = PNO BUDGET P1 150000 P2 135000 P3 250000 P4 310000 P5 500000 PROJ2 = PNO PNAME LOC P1 Instrumentation Tokyo P2 Database Develop. Oslo P3 CAD/CAM Oslo P4 Maintenance Paris P5 CAD/CAM Paris PROJ = PNO PNAME BUDGET LOC P1 Instrumentation 150000 Tokyo P2 Database Develop. 135000 Oslo P3 CAD/CAM 250000 Oslo P4 Maintenance 310000 Paris P5 CAD/CAM 500000 Paris AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 40 / 121

  23. Fragmentation and Allocation Data Distribution Strategies (12) Correctness of Fragmentation Completeness The decomposition of a relation R into fragments R 1 , R 2 , ..., R n is complete if and only if each data item in R can also be found in some R i . Reconstructibility If a relation R is decomposed into fragments R 1 , R 2 , ..., R n , then there should exist some relational operator ∇ such that R = ∇ n i =1 R i . Disjointness If a relation R is decomposed into fragments R 1 , R 2 , ..., R n , and data item d i is in R j , then d i should not be in any other fragment R k ( k � = j ). AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 41 / 121

  24. Fragmentation and Allocation Data Distribution Strategies (13) Allocation Alternatives Non-replicated : the fragments form a proper partition, and each fragment resides at only one site. Replicated : the fragments overlap, either fully (i.e., each fragment exists at every site) or partially (i.e., each fragment exists at some sites only). ◮ An often used rule-of thumb is that if the number of proper (i.e., read-only) queries is larger than the number of updating queries, then replication tends to be advantageous in proportion, otherwise the opposite is the case. ◮ Especially in the client/server case, caching is also part of the design considerations. ◮ Web giants (e.g., Facebook, Amazon) use replication extensively. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 42 / 121

  25. Fragmentation and Allocation Data Distribution Strategies (14) Replication v. Caching: Some Contrasts Replication Caching target server client or middle-tier granularity coarse fine storage device typically disk typically main memory impact on catalog yes no update protocol propagation invalidation remove copy explicit implicit mechanism separate fetch fault in and keep copy after use AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 43 / 121

  26. Fragmentation, in More Detail Data Distribution Strategies (15) Information Requirements ◮ The are four kinds of information required: ◮ about the database ◮ about the applications (i.e., the queries, by and large) ◮ about the communication network ◮ about the computer system AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 44 / 121

  27. Fragmentation, in More Detail Fragmentation Kinds ◮ Horizontal Fragmentation (HF) ◮ Primary Horizontal Fragmentation (PHF) ◮ Derived Horizontal Fragmentation (DHF) ◮ Vertical Fragmentation (VF) ◮ Hybrid Fragmentation (HF) AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 45 / 121

  28. Fragmentation, in More Detail Primary Horizontal Fragmentation (1) Information Requirements: Database ◮ We draw a link from a relation R to a relation S if we there is an equijoin on the key of R and the corresponding foreign key in S . ◮ We call R the owner, and S the member. ◮ We need the cardinalities of relations and the (average) length of their tuples. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 46 / 121

  29. Fragmentation, in More Detail Primary Horizontal Fragmentation (2) Information Requirements: Application (1) ◮ Given R with schema [ A 1 , . . . , A n ], a simple predicate p j has the form A i θ c where θ ∈ { = , � = , <, >, ≤ , ≥} , c ∈ Domain ( A i ). ◮ For a relation R , we define Pr = { p 1 , . . . , p m } . ◮ Given R and Pr , we define the set of minterm predicates p j ∈ Pr p ∗ M = { m 1 , . . . , m r } as M = { m k | m k = � j } , 1 ≤ j ≤ m , 1 ≤ k ≤ r , where p ∗ j = p j or else p ∗ j = ¬ p ∗ j . AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 47 / 121

  30. Fragmentation, in More Detail Primary Horizontal Fragmentation (3) Information Requirements: Application (2) Example Some (but not all) simple predicates on PROJ are: ′ Tokyo ′ p 1 : LOC = ′ Oslo ′ : = p 2 LOC ′ Paris ′ p 3 : LOC = : ≤ 200000 p 4 BUDGET Some (but not all) minterm predicates on PROJ are: LOC = ′ Tokyo ′ m 1 : ∧ BUDGET ≤ 200000 ¬ ( LOC = ′ Tokyo ′ ) m 2 : ∧ BUDGET ≤ 200000 LOC = ′ Tokyo ′ m 3 : ∧ ¬ ( BUDGET ≤ 200000) ¬ ( LOC = ′ Tokyo ′ ) : ∧ ¬ ( BUDGET ≤ 200000) m 4 AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 48 / 121

  31. Fragmentation, in More Detail Primary Horizontal Fragmentation (4) Information Requirements: Application (3) ◮ We also need quantitative information about the application: ◮ The selectivity of a minterm m i , denoted by sel ( m i ) is the number of tuples in the corresponding relation R that would be produced by σ m i ( R ). ◮ The access frequency of an application q i , denoted by acc ( q i ) is the number of times in which q i accesses data in a given period. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 49 / 121

  32. Fragmentation, in More Detail Primary Horizontal Fragmentation (5) Definition ◮ A (primary) horizontal fragment R j of a relation R is defined as R j ← σ m i ( R ) where m i is a minterm predicate on R . ◮ Given a set of minterm predicates M = { m 1 , . . . , m r } over R , one can define r horizontal fragments in R . ◮ [¨ Oszu and Valduriez, 1999] give an algorithm that, given a relation R and a set of simple predicates on R , produces a correct set of fragments from R . AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 50 / 121

  33. Fragmentation, in More Detail Primary Horizontal Fragmentation (6) Example (1) Example (Information Required) ◮ Let the relations PAY and PROJ be candidates for PHF. ◮ Let the following be the applications involved: ◮ A1: Find the name and budget of projects given their project number. ◮ A2: Find projects according to their budget. ◮ Let A1 be issued at three sites. ◮ Let one site access A2 for budgets below 200000, and the other two access A2 for those above. ◮ Let the following be the simple predicates: ′ Tokyo ′ p 1 : LOC = ′ Oslo ′ p 2 : LOC = ′ Paris ′ p 3 : LOC = p 4 : BUDGET ≤ 200000 p 5 : BUDGET > 200000 AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 51 / 121

  34. Fragmentation, in More Detail Primary Horizontal Fragmentation (7) Example (2) Example (Output) ◮ Applying the algorithm alluded to, the following minterm predicates LOC = ′ Tokyo ′ result: m 1 : ∧ BUDGET ≤ 200000 LOC = ′ Tokyo ′ m 2 : ∧ BUDGET > 200000 LOC = ′ Oslo ′ : ∧ BUDGET ≤ 200000 m 3 LOC = ′ Oslo ′ m 4 : ∧ BUDGET > 200000 LOC = ′ Paris ′ : ∧ BUDGET ≤ 200000 m 5 LOC = ′ Paris ′ m 6 : ∧ BUDGET > 200000 AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 52 / 121

  35. Fragmentation, in More Detail Primary Horizontal Fragmentation (8) Example (3) Example (Fragments Obtained) PROJ1 = PNO PNAME BUDGET LOC P1 Instrumentation 150000 Tokyo PROJ3 = PNO PNAME BUDGET LOC P2 Database Develop. 135000 Oslo PROJ4 = PNO PNAME BUDGET LOC P3 CAD/CAM 250000 Oslo PROJ6 = PNO PNAME BUDGET LOC P4 Maintenance 310000 Paris P5 CAD/CAM 500000 Paris AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 53 / 121

  36. Fragmentation, in More Detail Derived Horizontal Fragmentation (1) Definition ◮ A derived horizontal fragment is defined on a member relation according to a selection operation on its owner. ◮ Recall that a link from owner to member is defined in terms of an equijoin. ◮ A semijoin between R and S is defined as follows: R ⋉ S ≡ π A ( R ⊲ ⊳ S ), where A is the list of attributes in the schema of R . ◮ Given a link L , where owner ( L ) = S and member ( L ) = R , the derived horizontal fragments of R are defined as R i = R ⋉ S i , 1 ≤ i ≤ w , where w is the maximum number of fragments to be generated and S i = σ m i ( S ) is the primary horizontal fragment defined by the minterm predicate m i . AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 54 / 121

  37. Fragmentation, in More Detail Derived Horizontal Fragmentation (2) Example (1) Example (Information Required, Fragments Defined) ◮ Let there be a link L 1 with owner ( L 1 ) = PAY and member ( L 1 ) = EMP . ◮ Let PAY 1 ← σ SAL ≤ 30000 ( PAY ) and PAY 2 ← σ SAL > 30000 ( PAY ). ◮ Then two DHFs are defined: ◮ EMP 1 ← EMP ⋉ PAY 1 ◮ EMP 2 ← EMP ⋉ PAY 2 AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 55 / 121

  38. Fragmentation, in More Detail Derived Horizontal Fragmentation (3) Example (2) Example (Fragments Obtained) EMP1 = EMP2 = ENO ENAME TITLE ENO ENAME TITLE E3 A. Lee Mech. Eng. E1 J. Doe Elect. Eng. E4 J. Miller Programmer E2 M. Smith Syst. Anal. E7 R. Davis Mech. Eng. E5 B. Casey Syst. Anal. E6 L. Chu Elect. Eng. . E8 J. Jones Syst. Anal AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 56 / 121

  39. Fragmentation, in More Detail Vertical Fragmentation (1) ◮ Vertical fragmentation has also been studied in the centralized context since it is important for: ◮ normalization of designs ◮ physical clustering ◮ In terms of physical clustering, there is excitement in the DBMS industry (at the time of writing) about an extreme form of vertical partitioning in which single columns are stored separately. ◮ Certain access patterns are made easier by this and compression levels an order of magnitude larger can be obtained, which is important when dealing with the massive volumes of data that are typical of analytics workloads. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 57 / 121

  40. Fragmentation, in More Detail Vertical Fragmentation (2) ◮ Vertical fragmentation is more difficult than horizontal fragmentation, because more alternatives exist. ◮ Heuristic approaches that can be used are: grouping : one adds attributes to fragments one by one. splitting : one breaks down a relation into fragments based on access patterns. ◮ See [¨ Oszu and Valduriez, 1999] for an example (or else recall, from your earlier database studies the theory of normal forms and how it is justifiably disobeyed). AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 58 / 121

  41. Fragmentation, in More Detail Summary Data Distribution Strategies ◮ Fragmentation, allocation, replication and caching are all mechanisms that DDBMSs make use of to respond to the affinity of locality that data exhibits, particularly in decentralized organizations. ◮ The design decisions required are well-studied and well-founded solutions are available but require a great deal of information. ◮ The benefits can be significant particularly for response time because of the greater degree of natural parallelism that becomes possible. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 59 / 121

  42. Fragmentation, in More Detail Advanced Database Management Systems Distributed Query Processing Alvaro A A Fernandes School of Computer Science, University of Manchester AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 60 / 121

  43. Outline The Distributed Query Processing Problem Two-Phase Distributed Query Optimization Localization and Reduction Cost-Related Issues Join Ordering in DQP AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 61 / 121

  44. The Distributed Query Processing Problem Distributed Query Processing (1) What is the Problem? (1) ◮ Assume the fragments EMP i and ASG j to be stored in the sites shown in the figure. ◮ Assume the double-shafted arrows to denote the transfer of data between sites. ◮ Strategy 1 can be said to aim to do processing locally in order to reduce the amount of data that needs to be shipped to the result site, i.e., Site 5. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 62 / 121

  45. The Distributed Query Processing Problem Distributed Query Processing (2) What is the Problem? (2) ◮ Strategy 2 can be said to aim to ship all the data to, and do all the processing at, the site where results need to be delivered AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 63 / 121

  46. The Distributed Query Processing Problem Distributed Query Processing (3) Cost of Alternatives (1) Example (Assumptions) ◮ t a (tuple access cost) = 1 unit ◮ t t (tuple transfer cost) = 10 units ◮ | ASG | = 100 , length ( ASG ) = 10 , | EMP | = 80 , length ( EMP ) = 5 ◮ | ASG 1 | = | σ ENO ≤ ′ E 3 ′ ( ASG ) | = 50 ◮ | EMP 1 | = | σ ENO ≤ ′ E 3 ′ ( EMP ) | = 40 ◮ V ( ASG , RESP ) = 5 ◮ length ( ENO ) = 2 AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 64 / 121

  47. The Distributed Query Processing Problem Distributed Query Processing (4) Cost of Alternatives (2) Example (Consequences) ◮ size ( ASG ) = 100 × 10 = 1 , 000 , size ( EMP ) = 80 × 5 = 400 ◮ | ASG 2 | = | ASG | − | ASG 1 | = 100 − 50 = 50 ◮ size ( ASG 1 ) = size ( ASG 2 ) = | ASG 1 | × 10 = 50 × 10 = 500 ◮ | EMP 2 | = | EMP | − | EMP 1 | = 80 − 40 = 40 ◮ size ( EMP 1 ) = size ( EMP 2 ) = | EMP 1 | × 5 = 40 × 5 = 200 | ASG i | ◮ | ASG ′ 1 | = | ASG ′ V ( ASG , RESP ) = 50 2 | = | σ RESP = ′ manager ′ ( ASG 1 ) | = 5 = 10 ◮ | ASG ′ | = | ASG ′ 1 | + | ASG ′ 2 | = 10 + 10 = 20 ⊳ ENO ASG ′ ◮ length ( EMP i ⊲ i ) = length ( EMP ) + length ( ASG ) − length ( ENO ) = 10 + 5 − 2 = 13 AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 65 / 121

  48. The Distributed Query Processing Problem Distributed Query Processing (5) Cost of Alternatives (3) Example (Comparison (1)) Action Cost Formula Cost produce ASG ′ = 2 × | ASG i | × t a i = 2 × 50 × 1 100 transfer ASG ′ = 2 × size ( ASG ′ i to sites 3, 4 i ) × t t = 2 × 10 × 10 × 10 2,000 produce EMP ′ = 2 × | EMP i | × | ASG ′ i | × t a i = 2 × 40 × 10 × 1 800 transfer EMP ′ ⊳ ENO ASG ′ i to site 5 = 2 × size ( EMP i ⊲ i ) × t t = 2 × 40 × 10 × 13 × 10 2,600 40 Total Cost of Strategy 1 5,500 AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 66 / 121

  49. The Distributed Query Processing Problem Distributed Query Processing (6) Cost of Alternatives (4) Example (Comparison (2)) Action Cost Formula Cost transfer EMP to site 5 = size ( EMP ) × t t = 400 × 10 4,000 transfer ASG to site 5 = size ( ASG ) × t t = 1000 × 10 10,000 produce ASG’ = | ASG | × t a = 100 × 1 100 = | EMP | × | ASG ′ | × t a join EMP and ASG’ = 80 × 20 × 1 1,600 Total Cost of Strategy 2 15,700 AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 67 / 121

  50. The Distributed Query Processing Problem Distributed Query Processing (7) Query Optimization Objectives ◮ Minimize a cost function such as total time or response time. ◮ All components may have different weights in different distributed environments. ◮ One could have different goals, e.g., maximize throughput. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 68 / 121

  51. The Distributed Query Processing Problem Distributed Query Processing (8) Where Can Decisions Be Made? ◮ A single site determines the schedule. Centralized ◮ This is simpler, but requires knowledge about the entire distributed database. ◮ There is co-operation among sites to determine the Distributed schedule. ◮ This only requires sharing local information, but co-operation has a cost. ◮ One site determines the global schedule. Hybrid ◮ Each site optimizes the local subqueries. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 69 / 121

  52. The Distributed Query Processing Problem Distributed Query Processing (9) Issues Regarding the Network ◮ WANs have comparatively low Wide-Area Network (WAN) bandwidth, low speed and high protocol overhead ◮ As a result, communication cost will dominate, to the extent that it may be possible to ignore all other costs. ◮ Thus, the global schedule will aim to minimize communication cost. ◮ Local schedules are decided according to centralized query optimization decisions. ◮ Communication cost is not as Local-Area Network (LAN) dominant as in WANs. ◮ Thus, all components in the total cost function must be considered. ◮ Broadcasting is an option. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 70 / 121

  53. Two-Phase Distributed Query Optimization Distributed Query Optimization (1) Two-Phase Approach ◮ One way to implement distributed query optimization as a continuum with the centralized case is to structure the decision-making stages in such a way that the optimizer breaks the overall task into two phases. ◮ In the first phase, a single-node QEP is produced (that would run if the DBMS were not a distributed DBMS); in the second phase, this single-node QEP is transformed into a multi-node one. ◮ The second phase partitions a QEP into fragments linked by exchange operators, then schedules each fragment to execute in different component nodes. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 71 / 121

  54. Localization and Reduction Distributed Query Optimization (2) Localizing a Global Query ◮ Given an algebraic query on global relations: ◮ determine which are distributed; ◮ for those, determine which fragments are involved; ◮ replace references to global relations with the reconstruction expression (which is referred to as a localization program ). ◮ The leaves of distributed relations are replaced by its localization program over its fragments. ◮ The result is sometimes referred to as a generic query and is likely to benefit from optimization by reduction. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 72 / 121

  55. Localization and Reduction Distributed Query Optimization (3) Some Examples (1) ◮ Assume EMP is horizontally fragmented into EMP 1 , EMP 2 and EMP 3 as follows: 1. EMP 1 ← σ ENO ≤ ′ E 3 ′ ( EMP ) 2. EMP 2 ← σ ′ E 3 ′ < ENO ≤ ′ E 6 ′ ( EMP ) 3. EMP 3 ← σ ENO > ′ E 6 ′ ( EMP ) ◮ Assume ASG is horizontally fragmented into ASG 1 and ASG 2 as follows: 1. ASG 1 ← σ ENO ≤ ′ E 3 ′ ( ASG ) 2. ASG 2 ← σ ENO > ′ E 3 ′ ( ASG ) AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 73 / 121

  56. Localization and Reduction Distributed Query Optimization (4) Some Examples (2) ◮ Assume the following query: SELECT E.ENAME FROM EMP E WHERE E.ENO = ’E5’ ◮ The figure shows the corresponding generic query with the leaf replaced by its localization program. ◮ Then, the figure shows the query after optimization by reduction, in this case because it follows from the predicates that defined the fragments that only EMP 2 can contribute to the specified results. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 74 / 121

  57. Localization and Reduction Distributed Query Optimization (5) Some Examples (3) ◮ Assume the following query: SELECT E.ENAME FROM EMP E, ASG A WHERE E.ENO = A.ENO ◮ The figure shows the corresponding generic query with the leaf replaced by its localization program. ◮ We next show the query after reduction. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 75 / 121

  58. Localization and Reduction Distributed Query Optimization (6) Some Examples (4) ◮ The figure shows the reduced join query. ◮ Note that the optimizer has used the commutativity between join and union to push the joins upstream and reduce the amount of work. ◮ This also helps in scheduling the joins to execute in parallel. ◮ Note, finally, that the optimizer has made use of the fact that EMP 3 and ASG 1 do not share tuples (because their predicates lead to a contradiction, and hence would return an empty set) and eliminated the need to join them. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 76 / 121

  59. Localization and Reduction Distributed Query Optimization (7) Some Examples (5) ◮ Assume EMP is vertically ◮ Then, the figure shows the query fragmented into EMP 1 and EMP 2 after optimization by reduction, in as follows: this case because it follows from 1. EMP 1 ← π ENO , ENAME ( EMP ) the projection lists that defined the 2. EMP 2 ← π ENO , TITLE ( EMP ) fragments that only EMP 1 can contribute to the specified results. ◮ Assume the following query: SELECT E.ENAME FROM EMP E ◮ The figure shows the corresponding generic query with the leaf replaced by its localization program. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 77 / 121

  60. Localization and Reduction Distributed Query Optimization (8) A Detailed Example Derivation (1) ◮ Assume PROJ is horizontally fragmented into PROJ 1 , PROJ 2 and PROJ 3 as follows: 1. PROJ 1 ← σ LOC = ′ Tokyo ′ ( PROJ ) 2. PROJ 2 ← σ LOC = ′ Oslo ′ ( PROJ ) 3. PROJ 3 ← σ LOC = ′ Paris ′ ( PROJ ) ◮ Assume the following query: SELECT AVG(P.BUDGET) FROM PROJ P WHERE P.LOC = ’OSLO’ AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 78 / 121

  61. Localization and Reduction Distributed Query Optimization (9) A Detailed Example Derivation (2) (translate) γ AVG ( BUDGET ) ( σ LOC = ′ Oslo ′ ( PROJ )) (localize) γ AVG ( BUDGET ) ( σ LOC = ′ Oslo ′ ( PROJ 1 ∪ ( PROJ 2 ∪ PROJ 3 ))) (expand) γ AVG ( BUDGET ) ( σ LOC = ′ Oslo ′ ( σ LOC = ′′ Tokyo ′ ( PROJ ) ∪ ( σ LOC = ′ Oslo ′ ( PROJ ) ∪ σ LOC = ′ Paris ′ ( PROJ )))) (combine) γ AVG ( BUDGET ) ( σ LOC = ′ Oslo ′ ∧ LOC = ′ Tokyo ′ ( PROJ ) ∪ ( σ LOC = ′ Oslo ′ ∧ LOC = ′ Oslo ′ ( PROJ ) ∪ ( σ LOC = ′ Oslo ′ ∧ LOC = ′ Paris ′ ( PROJ )))) (simplify) γ AVG ( BUDGET ) ( σ ⊥ ( PROJ ) ∪ ( σ LOC = ′ Oslo ′ ( PROJ ) ∪ ( σ ⊥ ( PROJ )))) (simplify) γ AVG ( BUDGET ) ( ∅ ∪ ( σ LOC = ′ Oslo ′ ( PROJ ) ∪ ∅ )) (simplify) γ AVG ( BUDGET ) ( σ LOC = ′ Oslo ′ ( PROJ )) (simplify) γ AVG ( BUDGET ) ( PROJ 2 ) This derivation shows that the query can be executed only over the Oslo horizontal fragment PROJ 2 and wherever it is stored. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 79 / 121

  62. Cost-Related Issues Distributed Query Optimization (10) Scheduling Query Fragments ◮ Given a fragment query, find the best global schedule by minimizing a cost function. ◮ Join processing in centralized DBMSs tends to prefer linear (e.g., left-deep) trees because the size of the search space is reduced by the linearity constraint). ◮ However, in distributed DBMSs, join processing over bushy trees reveals opportunities for parallelism. ◮ Other decisions include: ◮ Which relation to ship where? ◮ Whether to ship the whole or to ship as needed? ◮ Whether to use semijoins? (Semijoins save on communication at the expense of more local processing.) AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 80 / 121

  63. Cost-Related Issues Distributed Query Optimization (11) Cost Functions Total Time (also referred to as Total Cost ): The overall strategy in this case is to ◮ Reduce the cost (i.e., time) in each component individually ◮ Do as little of each cost component as possible This optimizes the utilization of the resources and tends to increases system throughput. Response Time The overall strategy in this case is to do as many things as possible in parallel. However, this may increase the total time because of overall increased activity. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 81 / 121

  64. Cost-Related Issues Distributed Query Optimization (12) Total Cost ◮ The total cost is the summation of all cost factors: 1. Total cost = CPU cost + I/O cost + communication cost 2. CPU cost = unit instruction cost × no.of instructions 3. I/O cost = unit disk I/O cost × no. of disk I/Os 4. communication cost = (unit message initiation cost × no. of messages)+ (unit transmission cost × no. of bytes) AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 82 / 121

  65. Cost-Related Issues Distributed Query Optimization (13) Response Time ◮ The response time is the elapsed time between the initiation and the completion of a query. ◮ Processing and communication costs that are incurred in sequence in a component count at most once. ◮ If several sequential tasks are executed in parallel, the cost that is counted is the maximum cost of all those tasks. 1. Response time = CPU time + I/O time + communication time ◮ 2. CPU time = unit instruction time × no. of sequential instructions 3. I/O time = unit I/O time × no. of sequential I/Os 4. communication time = (unit message initiation time × no. of sequential messages) + (unit transmission time × no. of sequential bytes AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 83 / 121

  66. Cost-Related Issues Distributed Query Optimization (14) Some Cost Factors ◮ Message initiation and transmission costs are wide-area networks relatively high. ◮ Local processing cost is comparatively low (fast mainframes or minicomputers) ◮ Ratio of communication to I/O costs is high (2-digits to 1-digit?). ◮ Communication and local processing costs are local-area networks comparable. ◮ Ratio of communication to I/O costs is not high (close to 1:1?). AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 84 / 121

  67. Cost-Related Issues Distributed Query Optimization (15) Example: Total Cost v. Response Time ◮ Assume that: ◮ only the communication cost is considered ◮ one message conveys one unit of work (e.g., a tuple) ◮ Let UM denote the unit message initialization time and UT the unit transmission time. Let T send ( r , s , t ) denote the time to send r from s to t . ◮ Total time = ( n + m ) UM + ( np + mq ) UT ◮ Response time = max { T send ( n , 1 , 3) , T send ( m , 2 , 3) } ◮ T send ( n , 1 , 3) = nUM + npUT ◮ T send ( m , 2 , 3) = mUM + mqUT ◮ If n = 900, m = 1 , 000, p = 90, and q = 100, then ◮ Total time = 1 , 900 UM + 181 , 000 UT ◮ Response time = 1 , 000 UM + 100 , 000 UT AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 85 / 121

  68. Join Ordering in DQP Distributed Query Optimization (16) Join Ordering in Fragment Queries ◮ Given an n -ary relation R with attributes A 1 , . . . , A n , let | R | denote the cardinality of R , and let length ( A i ) denote the (possibly average) length in bytes of a value from the domain of A i , in which case the (possibly average) length of a tuple in R is length ( R ) = � n i =1 length ( A i ). ◮ Let size ( R ) = | R | × length ( R ). ◮ Given two relations R and S that are not co-located, we ship R to the site of S if size ( R ) ≤ size ( S ) and we ship S to the site of R if size ( S ) < size ( R ). ◮ For many relations, there may be too many alternatives. ◮ Also, computing the cost of all alternatives and selecting the best one depends on computing the size of intermediate relations, which is difficult. ◮ In practice, heuristics are needed. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 86 / 121

  69. Join Ordering in DQP Distributed Query Optimization (17) Join Ordering: An Example (1) ◮ Consider the 2-way join PROJ ⊲ ⊳ PNO ( ASG ⊲ ⊳ ENO EMP ) ◮ The join graph shows the sites where each relation is, and there is an edge between two relations if an equijoin on the edge label is required. ◮ The many different execution alternatives are shown next, with a double-shafted arrow denoting the shipment of the relation in the left to the site in the right, and the ’@’ sign denoting that the left-hand side expression is evaluated at the site in the right-hand side. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 87 / 121

  70. Join Ordering in DQP Distributed Query Optimization (18) Join Ordering: An Example (2) 1. 1.1 EMP ⇒ 2 3. 3.1 ASG ⇒ 3 1.2 EMP’ ← EMP ⊲ ⊳ ASG @ 2 3.2 ASG’ ← ASG ⊲ ⊳ PROJ @ 3 1.3 EMP’ ⇒ 3 3.3 ASG’ ⇒ 1 1.4 EMP’ ⊲ ⊳ PROJ @ 3 3.4 ASG’ ⊲ ⊳ EMP @ 1 2. 2.1 ASG ⇒ 1 4. 4.1 PROJ ⇒ 2 2.2 EMP’ ← EMP ⊲ ⊳ ASG @ 1 4.2 PROJ’ ← PROJ ⊲ ⊳ ASG @ 2 2.3 EMP’ ⇒ 3 4.3 PROJ’ ⇒ 1 2.4 EMP’ ⊲ ⊳ PROJ @ 3 4.4 PROJ’ ⊲ ⊳ EMP @ 1 5. 5.1 EMP ⇒ 2 5.2 PROJ ⇒ 2 5.3 PROJ ⊲ ⊳ (ASG ⊲ ⊳ EMP)@ 2 AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 88 / 121

  71. Join Ordering in DQP Distributed Query Optimization (19) Join Ordering: An Example (3) 1. An alternative to enumerating all possibilities is to use the heuristic of considering only the sizes of the operands and assuming that the cardinality of the join is the product of the input cardinalities. 2. In this case, relations are ordered by increasing sizes and the order of execution is given by this ordering and the join. 3. For example, the order (EMP, ASG, PROJ) could use Strategy 1, and the order (PROJ, ASG, EMP) could use Strategy 4. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 89 / 121

  72. Join Ordering in DQP Distributed Query Optimization (20) Approaches Based on Semijoins (1) ◮ Consider the join of two relations R[A] (located at site 1) and S[A] (located at site 2). ◮ One could evaluate R ⊲ ⊳ A S . ◮ Alternatively,one could evaluate one of the equivalent semijoins: R ⊲ ⊳ A S ⇔ ( R ⋉ A S ) ⊲ ⊳ A S ⇔ ⊳ A ( S ⋉ A R ) R ⊲ ⇔ ( R ⋉ A S ) ⊲ ⊳ A ( S ⋉ A R ) AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 90 / 121

  73. Join Ordering in DQP Distributed Query Optimization (21) Approaches Based on Semijoins (2) 1. Using a join: 1.1 R ⇒ 2 1.2 R ⊲ ⊳ A S@ 2 2. Using a semijoin: 2.1 S’ ← π A ( S ) 2.2 S’ ⇒ 1 2.3 R’ ← R ⋉ A S’ @ 1 2.4 R’ ⇒ 2 2.5 R’ ⊲ ⊳ A S @ 2 Semijoin is better if size ( π A ( S )) + size ( R ⋉ A S )) < size ( R ) AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 91 / 121

  74. Join Ordering in DQP Summary Distributed Query Processing ◮ There is an evolutionary continuum from centralized to distributed query optimization. ◮ Localization and reduction are the main techniques by which a heuristically-efficient distributed QEP can be arrived at. ◮ In wide-area distributed query processing (DQP), communication costs tend to dominate, although in local-area networks this is not the case. ◮ The join ordering problem remains, here too, an important one. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 92 / 121

  75. Join Ordering in DQP Advanced Database Management Systems Data Integration Strategies Alvaro A A Fernandes School of Computer Science, University of Manchester AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 93 / 121

  76. Outline Data Integration: Problem Definition Process Alternatives View-Based Data Integration Schema Matching, Mapping and Integration Dataspaces AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 94 / 121

  77. Data Integration: Problem Definition Data Integration (1) Problem Definition ◮ Data(base) integration is the process as a result of which a set of component DBMSs are conceptually integrated to form a multi-DBMS, i.e., a DDBMS that offers a single, logically coherent schema to users and applications. ◮ Equivalently, given existing databases with their Local Conceptual Schemas (LCSs), data integration is the process by which they are integrated into a Global Conceptual Schema (GCS). ◮ A GCS is also called a mediated schema, or, more simply, a global schema. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 95 / 121

  78. Data Integration: Problem Definition Data Integration (2) Some Assumptions, Some Issues ◮ In general, the problem only arises if the component DBMSs already exist, so data integration is typically a bottom-up process. ◮ In some respects, it can be conceived of as the reverse of the data distribution (i.e., fragmentation and allocation) problem. ◮ One of the most important concerns in data integration is the level of heterogeneity of the component DBMSs. ◮ This, in turn, is strongly linked to the degree of autonomy that each component DBMSs enjoys and exercises. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 96 / 121

  79. Process Alternatives Data Integration (3) Some Alternatives Physical Integration : in this case, the source databases are integrated and the outcome is materialized. It is the more common practice in data warehousing. Logical Integration in this case, the global schema that emerges from integrating the sources remains virtual. It is the more common practice when the component DBMSs enjoy autonomy (e.g., in scientific contexts, where different research groups maintain different data resources but still allow them to be part of a multi-DBMS of interest to them and to others). AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 97 / 121

  80. Process Alternatives Data Integration (4) A Bottom-Up Process ◮ The most widely-used approach involves: ◮ translation ◮ Each LCS abstracts over a data source. ◮ A translator maps across to and from concepts in the LCS and concepts in an intermediate schema (IS). ◮ integration ◮ The ISs are cast in an interlingua, a canonical formalism in which the LCSs of the participating sources can be cast. ◮ The integrator uses the ISs to project out the GCS to users and applications. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 98 / 121

  81. Process Alternatives Data Integration (5) Dealing with Autonomy and Heterogeneity ◮ In contexts where heterogeneity is the norm (e.g., when the multi-DBMS is formed from public resources) the translators are often referred to as wrappers and the integrator is referred to as the mediator . ◮ Wrappers can reconcile different kinds of heterogeneity, e.g.: infrastructural including those stemming from the system software or network level syntactic including those relating to data model and query languages (e.g., generating a relational view of a spreadsheet) semantic which are the hardest to capture and maintain in sync AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 99 / 121

  82. View-Based Data Integration Data Integration (6) Schemas as Views ◮ There are two major possibilities to relate a GCS and its LCSs by means of views: Global-As-View (GAV) : in this case, the LCSs are the extents over which one writes a set of views that, together, comprise the GCS against which global queries are formulated. Local-As-View (LAV) : in this case, the GCS is assumed to exist and each LCS is treated as if it were a view over this postulated GCS. ◮ We will focus on GAV, and a simple example of how it works is coming soon. AAAF (School of CS, Manchester) Advanced DBMSs 2011-2012 100 / 121

Recommend


More recommend