niagaracq a scalable
play

NiagaraCQ: A Scalable Motivation Continuous Query System What is - PDF document

Outline NiagaraCQ: A Scalable Motivation Continuous Query System What is NiagaraCQ ? for Internet Databases Details Performance Conclusion Jianjun Chen David J. DeWitt Feng Tian Yuan Wang Computer Sciences Department


  1. Outline NiagaraCQ: A Scalable � Motivation Continuous Query System � What is NiagaraCQ ? for Internet Databases � Details � Performance � Conclusion Jianjun Chen David J. DeWitt Feng Tian Yuan Wang Computer Sciences Department University of Wisconsin-Madison Motivation What ’ s NiagaraCQ? � Continuous queries are growingly popular. Why? � The continuous query sub system of Niagara, � Allow users to receive new results when they become available without having to issue the same query which is a distributed database system for repeatedly. querying distributed XML data sets using a � Especially useful in an environment like the Internet query language like XML-QL. comprises of large amounts of frequently changing information � Supports scalable continuous query � Challenges: processing over multiple, distributed XML � Need to be able to support millions of queries due to the files. scale of the Internet. � No existing systems have achieved this level of scalability. NiagaraCQ Novelty and Approaches NiagaraCQ Command Language � Grouping. � CREATE CQ_name � Incremental group optimization strategy with XML-QL query dynamic re-grouping. DO action � Query-split scheme. � Support both change-based and timer-based queries { START start_time } { EVERY time_interval } in a uniform way. { EXPIRE expiration_time } � To ensure scalability, need to do more: � Incremental evaluation of continuous queries. � Delete CQ_name � Use of both pull and push models for detecting heterogeneous data source changes. � Memory caching. 1

  2. Incremental group optimization Expression Signature Represent the same syntax structure, but possibly different � � Groups are created for existing queries according to constant values, in different queries. their “ signatures” , which represent similar structures � Expression signatures allow queries with the same syntactic structure to be grouped together to share computation among the queries � Groups allow the “ common parts” of two or more queries to be shared. � Each individual query in a query group shares the results from the execution of the “ group plan” . � Each individual query in a query group shares the results from the execution of the group plan. � New query is merged into existing groups whose signatures match that of the query XML-QL examples (Fig. 3.1) Expression Signature (Fig. 3.2) Group Group (cont.) � Groups are created for queries based on their � Group plan : the group plan is the query plan shared by all expression signatures. Consists of 3 parts: queries in the group. It is derived from the common part of all single query plans in the group. � Group signature : The common expression signature of all queries in the group. � Group constant table : The group constant table contains the signature constants of all queries in the group. The Split operator Incremental Grouping Algorithm � When a new query is submitted: � The result of the shared computation contains results for all the queries in the group. How to filter The group optimizer traverses its 1. query plan bottom up and tries to and send the results sent to the correct destination match its expression signature with the signatures of existing groups. operator for further processing ? The group optimizer breaks the 2. new query plan into two parts The lower part of the query is 3. � A Split operator is combined with a Join operator removed. The upper part of the query is added onto the group based on the constant values stored in the constant plan. If constant table does not have an table to perform filtering. 4. entry “AOL”, it will be added and a new destination buffer allocated. � Distributes each result tuple of the Join operator to its correct destination based on the destination If no match, a new group will be � generated for this signature and buffer name in the tuple (obtained from the added to the group table. Constant Table). 2

  3. Discussion (1) Buffer Design � 1. Provide arguments justifying that this � The destination buffer for the split operator is ……. (NiagaraCQ) is a better application for XML needed (group 1 and 3) or for relational data (group 2 � Pipelined scheme Operator and 4)? Why? buffer � Intermediate Files � 2. To support you answer, provide some Operator examples (applications) where this kind of buffer size/scalability is needed. Split Pipeline approach Materialized Intermediate Files � Tuples are pipelined from the output of one operator into the input of the next operator. � Doesn’t work for grouping timer-based CQ’s. It’s difficult for a split operator to determine which tuple should be stored and how long they should be stored for. � The query structure is a directed graph, not a tree and hence the plan may be too complicated for a general XML- QL query engine to execute. � The combine plan may be very large requires resources beyond the limits of system. � A large portion of the query plan may not need to be executed at each query invocation. � One query may block many other queries. Materialized Intermediate Files (cont.) Timer-based Continuous Queries � Advantages � Grouped in the same way as change-based queries except that the time information � Intermediate files and data sources are monitored needs to be recorded at installation time. uniformly. � Challenges � Each query is scheduled independently. � Hard to monitor the timer events of those queries. � The potential bottleneck problem of the pipelined � Sharing the common computation becomes approach is avoided. difficult due to the various time intervals. � Disadvantages � Timer-based continuous queries fires at � Extra disk I/Os. specific times, but only if the corresponding � Split operator becomes a blocking operator. input files have been modified. 3

  4. Incremental Evaluation Memory Caching � Incremental evaluation allows queries to be invoked � Caching is used to obtain good performance only on the changed data. with a limited amount of memory. � For each file, on which CQ’s are defined, � Caches query plans, system data structures, NiagaraCQ keeps a “delta file” that contains recent and data files for better performance. changes. � Queries are run over the delta files whenever possible instead of their original files. � A time stamp is added to each tuple in the delta file. NiagaraCQ fetches only tuples that were added to the delta file since the query’s last firing time. What should be cached? Some performance comparisons � Grouped query plans, assume that the number of query groups is relatively small. � Recently accessed file. � The event list for monitoring the timer-based events. But it can be large, so a “time window” of this list is kept. Conclusion Discussion (2) � Incremental grouping methodology makes group optimization � Q: Another similar (in terms of continuous information retrieval) more scalable. and very popular technology is RSS which also uses XML. Identify what types of applications are better off with RSS and in which scenario will you use a system like NiagaraCQ? A query-split scheme requires minimal changes to a general � purposed query engine. In this model, both timer-based and change-based continuous queries can be grouped together for � Optional (if time permits): This paper has some conceptual / event detection and group execution. functional similarities with other systems, e.g. use of time concept, integration of information from various sources. Compare and contrast these things and what are the challenges Incremental evaluation of continuous queries, use of both pull � for this kind of systems? and push models for detecting heterogeneous data source changes and a caching mechanism further improve scalability. 4

Recommend


More recommend