An Adaptive Query Execution Engine for Data Integration Zachary Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld University of Washington Presented by Peng Li@CS.UBC 1
Outline • The Background of Data Integration Systems • Tukwila Architecture • Interleaving of planning and execution • Adaptive Query Operators Collector & Double Pipelined Join • Performance 2
Background (Data Integration Systems) Data Integration System Multiple autonomous (can’t affect behavior of sources) heterogeneous (different models and schemas) data sources 3
The key of DISs: • “Free users from having to locate the sources relevant to their query interact with each source independently manually combine the data from the different sources” 4
The main challenges of the design of DISs: • Query Reformulation • The construction of wrapper programs • Query optimizers and efficient query execution engines 5
Motivations: � Little information for cost estimates � Unpredictable data transfer rates Adaptive � Unreliable, overlapping sources � Want initial results quickly � Network bandwidth generally constrains the data sources to be smaller than in traditional db applications. 6
Tukwila Architecture 1. a semantic description of the contents of the data sources; 2. overlap information about pairs of data sources ; 3. key statistics about the data, such as the cost of accessing each source and so on 7
Novel Features of Tukwila • Interleaving of planning and execution – Compensates for lack of information • Handle event-condition-action rules – When and how to modify the implementation of certain operators at runtime if needed. – Detect opportunities for re-optimization. • Manages overlapping data sources ( collectors ) • Tolerant of latency ( double-pipelined join ) – Returns initial results quickly 8
RAP1 Interleaving of planning and execution the non-traditional characteristics of Tukwila are as following, – The optimizer can only create a partial plan if essential statistics are missing or uncertain – The optimizer generates not only operator trees but also the appropriate event-condition- action rules. – The optimizer conserves the state of its search space when it calls the execution engine. 9
Slide 9 RAP1 1. too wordy 2. you mean "characteristics" rather than "characters" Rachel Pottinger, 2/20/2006
Overview of the query plan structure The fragment structure is the key mechanism for implementing the • A plan includes a partially-ordered set of Why does the system need the fragment structure? adaptive property: at the end of each fragments and a set of global rules fragment, the rest of the plan can be re- optimized or rescheduled • A fragment consists of a fully pipelined tree of physical operators and a set of local rules. 10
Rules • Re-optimization The optimizer’s cardinality estimate for the fragment’s result is significantly different from the actual size ->reinvoke optimizer • Contingent planning The execution engine checks properties of the result to select the next fragment • Rescheduling Reschedule if a source times out • Adaptive operators 11
RAP2 Rule format When event if condition then actions When closed(frag1) if card(join1)>2*est_card(join1) then replan An event triggers a rule, coursing it to check its condition. If the condition is true, the rule fires, executing the action(s). 12
Slide 12 RAP2 Given time constraints, I'd cut slides 12 & 13 Rachel Pottinger, 2/20/2006
Events Actions open, closed: fragment/operator starts or completes error: operator failure, e.g., unable to contact source set the overflow method for a double pipelined join timeout(n): data source has not responded in n msec. alter a memory allotment out-of memory: join has insufficient memory deactivate an operator or fragment, which stops its execution and deactivates its associated rules reschedule the query operator tree re-optimize the plan Conditions return an error to the user state(operutor): the operator’s current state card(operator): the number of tuples produced so far time(operator): the time waiting since last tuple memory(operator): the memory used so far 13
Group Discussion • For one of the following motivating situations of Tukwila – Absence of statistics – Unpredictable data arrival characteristics – Overlap and redundancy among sources – Optimizing the time to initial answers • Q1: Can you give some examples where the chosen topic matters? • Q2: If you are a member of Tukwila team, what rules or policy would you have to deal with the problem? – To help discussion, more specific situations will be given – But you may assume any problem or situation • Discussion – Form 8 groups (3~4 person per group, two teams per topic) – Discuss Q1 and Q2 for one topic (5 ~ 7 minutes) 14
Examples Orders OrderNo TrackNo 1234 01-23-45 1235 02-90-85 1399 02-90-85 Join Orders.TrackNo = UPS.TrackNo (Orders, UPS) 1500 03-99-10 OrderNo TrackNo Status UPS 1234 01-23-45 In Transit TrackNo Status 1235 02-90-85 Delivered 01-23-45 In Transit 1399 02-90-85 Delivered 02-90-85 Delivered 1500 03-99-10 Delivered 03-99-10 Delivered 04-08-30 Undeliverable 15
Query Plan Execution Query plan represented as data-flow tree: • Control flow “Show which orders have – Iterator (top-down) been delivered” • Most common database model Join Orders.TrackNo = UPS.TrackNo • Easier to implement – Data-driven (bottom-up) • Threads or external Select Status = “Delivered” scheduling Read Read • Better concurrency Orders UPS 16
Tukwila Plans & Execution • Multiple fragments ending (3) at materialization points Join Orders.TrackNo = UPS.TrackNo • Rules triggered by events – Re-optimize remainder if (1) (2) Select Status = “Delivered” necessary – Return statistics Read Read Orders UPS When(closed(1)): if size_of(Orders) > 1000 then reoptimize {2, 3} 17
RAP3 Performance evaluation Interleaving Planning and Execution We can find that Tukwila’s strategy of interleaving planning and execution can slash the total time spent processing a query. With a total speedup of 1.42 over pipeline and 1.69 over the naïve strategy of materializing . 18
Slide 18 RAP3 Given time constraints, I'd cut this slide Rachel Pottinger, 2/20/2006
Adaptive Query Operators Collectors • Overlap issues Data Integration Systems needs to perform a union over a large number of overlapping sources. However, a standard union operator has no mechanism for handling errors or for deciding to ignore slow mirror data sources once it has obtained the full data set. Q A ∪ Q B ∪ Q C A B C Mirror_C 19
RAP4 Collectors (cont.) • Collectors can deal with the problems by using policies! • A collector operator = a set of children (wrapper calls, local data and so on) + a policy for contacting them 20
Slide 20 RAP4 Again, considering time constraints, consider cutting slides 19 and 20 Rachel Pottinger, 2/20/2006
Collectors (cont.) A complex policy example, Tukwila A B 21
Adaptive Query Operators Double Pipelined Join Conventional Joins • Sort merge joins &indexed joins ---can not be pipelined • Nested loops joins and hash joins ---Follow an asymmetric execution model For Nested loops joins, we must wait for the entire inner table to be transmitted initially before pipelining begins For hash joins, we must load the entire inner relation into a hash table before we can pipeline. 22
Double Pipelined Hash Join • Proposed for parallel main-memory databases (Wilschut 1990) – Hash table per source – As a tuple comes in, add to hash table and probe opposite table • Evaluation: – Results as soon as tuples received – Symmetric – Requires memory for two hash tables • But data-driven! 23
Orders OrderNo TrackNo Hash Table 1234 01-23-45 (Orders) 1235 02-90-85 1399 02-90-85 …… …… Join Orders.TrackNo = UPS.TrackNo (Orders, UPS) UPS Hash Table TrackNo Status (UPS) 01-23-45 In Transit 01-23-45 02-90-85 Delivered 03-99-10 Delivered …… …… 24
Double-Pipelined Join Adapted to Iterator Model • Use multiple threads with queues – Each child (A or B) reads tuples until full, then sleeps & awakens parent – Join sleeps until awakened, then: Join • Joins tuples from QA or QB, returning all QA QB matches as output A B • Wakes owner of queue 25
RAP5 Performance Evaluation: Double Pipelined Hash Join 26
Slide 26 RAP5 Again, consider cutting due to time constraints. Rachel Pottinger, 2/20/2006
Insufficient Memory? • May not be able to fit hash tables in RAM • Strategy for standard hash join – Swap some buckets to overflow files – As new tuples arrive for those buckets, write to files – After current phase, clear memory, repeat join on overflow files 27
Conclusions • General Tukwila architecture • Non-conventional characters of Tukwila • Interleaving of optimization and execution • main idea of the collector operator • Double pipelined hash join 28
Recommend
More recommend