Query Processing, Resource Management, And Approximation in a Data Stream Management System Kevin Hoeschele Archana Joshi
References • R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein and R. Varma. Query Processing, Resource Management, and Approximation in Data Stream Management System. In Proceeding of the 2003 CIDR conference • B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom. Models and Issues in Data Stream Systems. Invited paper in Proc. of the 2002 ACM Symp. on Principles of Database Systems (PODS 2002), June 2002 • A. Arasu, S. Babu and J. Widom. An Abstract Semantics and Concrete Language for Continuous Queries Over Streams and Relations. Technical Report, November 2002
Important features of the STREAM system • A datastream management system (DSMS) • A declarative query language called CQL for continuous queries • Queries handling both continuous data streams and relations • Designed for changing and high data flow rates and query work load - Good resource allocation - Approximation and resource management Next - CQL
CQL (Continuous Query Language) • Extension of SQL with support for sliding Windows and sampling for approximations • Supports both data streams and relations • Supports additional operators like Istream and Dstream
Relations Data Streams Have arrival order and Unordered and finite unbounded Append only Updates, deletions and insertions Stream resulting from Relations are stored and continuous source and also result from also from subqueries subqueries
Formal CQL Semantics • Based on existing well understood semantics • Additional transformations between relations and streams • Assumes a global, discrete, ordered time domain (will discuss later) • Relations - Maps time T to set of tuples in R • Stream - Set of (tuple, timestamp) element - Stream at time T = all elements with timestamp <= T
Sample query 1 Consider a stream of telephone call records ‘Calls’ having Attributes: cust_id, type, minutes, timestamp Compute the average call length considering only the last day’s long distance calls placed by each customer SELECT AVG(S.minutes) FROM Calls S [PARTITION BY S.cust_id Range 1 Day Preceding WHERE S.type = ‘Long distance’]
Sample query 2 Extract 10% sample of calls placed by ‘Gold’ customers and then stream the average where the cust_id is in range 1 to 1000 SELECT AVG(S.minutes) FROM (SELECT S.minutes FROM Calls S, Customer R WHERE S.cust_id= R.cust_id AND R.tier = ‘Gold’) V Sample (10) WHERE V.cust_id BETWEEN 1 AND 1000 Here we are joining a stream with relation
Transformation • Stream Mapped to Relation - A Stream with a window specification (Rows,Range, partition by) upto a specific time T is a finite set and treated as a relation • Relations Mapped to Streams - Istream(R ) contains stream Elements(T,s) where tuple s is in R at time T but not in R at time T-1 - Dstream(R ) contains a stream Element(T,s) where tuple s is in R at time T-1 but not in R at time T
Timestamps • Stream Elements arrive in order and timestamped according to a global clock • All relation updates are also timestamped According to the global clock • Can Handle Application Designed time also Next - query plans
Query Plans • Query plan runs continuously and supports three components 1. Query operators - Read a stream of tuples, process them and write into output queue. 2. Inter-operator Queues - Connect different operators and define path along which tuples flow as in DBMS. 3. Synopses - Maintain State associated with operators.
Synopsis • Summarizes the tuples seen so far at some intermediate operator • Maintains one Synopsis for each join operator • Needs some kind of summarization technique to limit size • Synopsis are tied to operators • Generic interfaces for both allowing to couple any synopsis type with any operator type • Supports generic methods to create, changeMem, insert, delete
Resource sharing in Query Plans • Different queries with the same operations (input and operator the same) can share it to reduce redundancy • Inter-operator Queues after shared Operations have pointers for each Query • Data deleted after each pointer has past it • Not useful when operations have vastly different consumption rate - creates large queues
Resource Management • Number of relevant resources like memory, computation, I/O • Will focus on memory management • Two Techniques - Algorithm for incorporating known constraints on input data streams to reduce synopsis size - Algorithm for Query scheduling that minimizes queue size
Constraints • Set by collecting statistics on data, Related to punctuation • Adherance Parameter - sees how close stream fits a constraint - the closer stream is to constraint the smaller the Synopsis size is • No precision loss.
Stream constraint example • Consider a continuous query that joins streams Orders (O) and Fulfillments (F) based on orderID - Ordered (k) : If we know k tuples for a given orderId arrive on O before arriving on F then a join synopsis on F doesn’t need to be kept for those k tuples K tuples O synopsis Stream O Joined output Join Stream F F synopsis
Global Scheduler •Says which Queries are run, and When •Ways to Create weights for Scheduling Queries - Response time - Throughput - queue size *chosen by STREAM • Greedy Schedule - next operator chosen will consume most tuples/time unit •Scheduling chains like auroras Train scheduling
Query Scheduling Example Q 1 Q 2 O 1 O 2 Operator 1 - 20% selectivity - takes in N tuples per time segment Operator 2 - takes in N/5 tuples per time segment Strategy 1: each window of N tuples goes completely through Strategy 2 - Greedy : if Q1 has atleast N tuples, will always do that first Queue size in increments of N over Time 1 2 3 4 5 6 7 1 4 Q 1 1 2 2 3 3 Strategy 1 Q 2 0 0.2 0 0.2 0 0.2 0 Q 1 1 1 1 1 1 1 1 Strategy 2 Q 2 0 0.2 0.4 0.6 0.8 1 1.2
Approximations • Static vs Dynamic Approximation - static, a certain query behavior is guaranteed, user can participate - In Dynamic, the level of approx changes, adapts to current resource availability • Approximation techniques: - Window reduction size - reduces synopsis size and time to do window joins - Sampling - dropping output data at a certain % - Load Shedding - similar to sampling but lets chunks of tuples get dropped at a time, reduces queue size..
Future Resource management • Able to monitor Queue and synopsis size, and react when reduction is needed • Reallocation algorithm to deal with changes in the rate and distribution of incoming data • Able to dynamically add, delete, activate and deactivate queries
Summary • STREAM system supports a declarative query language for operations on Stream and Relations • It supports high data rates and varying work loads. • A near-optimal scheduling algorithm for reducing inter-operator queue sizes • A set of techniques for static and dynamic approximation to cope with limited resources.
Discussion (CACQ and STREAM Questions) • What are some of the differences between Aurora and Stream? - CQL vs aurora - Query plans - windowing techniques - goals: total throughput(Aurora) vs minimizing queue size(STREAM) which is better? • Why focus on memory for resource management? • What are the disadvantages of using eddies? • What assumptions are made that make the sharing of operators effective? • How does CACQs uses of eddies differ from their use in Telegraph, and what are the pros and cons of this approach?
Psoup Questions PSoup Question 1 PSoup is currently implemented in main memory. This gives the system great speed, but limits the amount of data that can be stored for purposes of windowing. What alternatives could increase the amount of storage without significantly hitting the performance? PSoup Question 2 The creators of PSoup posit that their system can be applied to Data Recharging scenarios. Is this really plausible? Consider: * PSoup's main memory limitations * PSoup's applicability to the 'Net * Data-recharging utility functions
Recommend
More recommend