Query Evaluation Techniques for Large Databases GOETZ GRAETE Presented by Long Zhang Dept. of Computer Science, UBC zhlong73@cs.ubc.ca 1
Purpose � To survey efficient algorithms and software architecture of query execution engines for executing complex queries over large databases . 2
Restrictions Restrictive data definition and manipulation languages can make application development and maintenance unbearably cumbersome. Data volumes might be so large or complex that the real or perceived performance advantage of file systems is considered more important than all other criteria, e.g. the higher levels of abstraction and programmer productivity typically achieved with database management systems. 3
Topics � algorithms and there costs � sorting versus hashing � parallelism � resource allocation and scheduling � general performance-enhancing techniques � … … 4
Which parts we concern? User I nterface Database Query Language Query Optimizer Query Execution Engine Files and I ndices I / O Buffer Disk 5
Query Execution Engine � What is it? � Collection of query execution operators and mechanisms for operator communication and synchronization � Query execution engine defines the space of possible plans that can be chosen by query optimizer. 6
� Translate logical query from SQL to query tree in logical algebra.. � Query tree in logical algebra is translated into a physical plan � Optimizer expands search space and finds best plan. � Optimal physical plan copied out of optimizer’s memory structure and sent to query execution engine. � query execution engine executes plan using relations in database as input, and produces output 7
Discussion � Fact: On the first page, the author states that DBMSs have not been used in many areas for two reasons. � Application development and maintenance is difficult. (Restrictive data definition and manipulation languages can make the development difficult) � The data in those areas is SO big, that speed trumps all, and people would rather hand- code.
Questions are… � Do you think in today’s world databases are popular for most applications? � IF NO, GO TO (1), IF YES, GO TO (2) � 1)Why do you think databases aren't used more? Why don't you use them on your data? � 2)How do database systems' needs for dealing with large amounts of data differ from other applications or systems such as OS? Should we only depend on it? � IF YES, what should be changed in the architecture of databases or data abstraction?
1. Architecture of Query Execution Engines � Focus on useful mechanisms for processing sets of items � Records � Tuples � Entities � Objects 10
Logical algebra VS Physical algebra � Logical algebra, e.g. relational algebra, is more closely related to the data model and defines what queries can be expressed in the data model. � Physical algebra is system specific. For example, nested-loops joins, merge-join, hash join, etc. � Systems may implement same model and logical algebra with different physical algebra. 11
Logical algebra VS physical algebra � Specific algorithms and therefore cost functions are associated only with physical operators not logical algebra operators � Mapping logical to physical non–trivial: � It involves algorithm choices � Logical and physical operators not directly mapped � Some operators in physical algebra may implement multiple logical operators � etc 12
To sum up � Query optimization is the mapping from logical to physical operations � Query execution engine is the implementation of operations on physical representation types and of mechanisms for coordination and cooperation among multiple such operations in complex queries. 13
Implementation issues Prepare an operator for producing data : Open Produce an item : Next Perform final housekeeping: Close 14
Iterators � Two important features of operators � Can be combined into arbitrarily complex evaluation plans � Any number of operators can schedule and execute each other in a single process without assistant from underlying OS
Observations � Entire query plan executed within a single process � Operators produce an item at a time on request � Items never wait in a temporary file or buffer (pipelining) � Efficient in time-space-product memory cost � Iterators can schedule any type of trees including bushy trees � No operator affected by the complexity of the whole plan
Sorting &Hashing � The purpose of many query-processing algorithms is to perform some kind of matching, � i.e., bringing items that are “alike” together and performing some operation on them. � There are two basic approaches used for this purpose: � sorting � hashing. � These are the basis for many join algorithms
Design Issues � Sorting should be implemented as an iterator � In order to ensure that sort module interfaces well with the other operators, (e.g., file scan or merge-join). � Input to the sort module must be an iterator, and sort uses open, next, and close procedures to request its input � therefore, sort input can come from a scan or a complex query plan, and sort operator can be inserted into a query plan at any place or at several places.
Out of Memory � If the input is larger than main memory, the open-sort procedure creates sorted runs and merges them until only one final merge phase is left. � Dividing and combining
More on Sorting � For sorting large data sets there are two distinct sub-algorithm : � One for sorting within main memory � One for managing subsets of the data set on the disk. � For practical reasons, e.g., ensuring that a run fits into main memory, the disk management algorithm typically uses physical dividing and logical combining (merging). � A point of practical importance is the fan-in or degree of merging, but this is a parameter rather than a defining algorithm property.
Creating level-0 runs � In memory sort, e.g. Quicksort. By doing so, each run will have the size of allocated memory. � Replacement selection. Using priority heap, which support insert and remove- smallest operations. The main purpose is to generate the longest initial ordered sequence for each file.
Replacement Selection Sort � Input Buffer, Output Buffer and Ram have the same size. Output I nput I nput Output RAM File File Buffer Buffer
Replacement Selection Sort � In memory: 12 Next input: 20 10 19 31 Do not need to care When the run is Larger than the memory 40 56 25 21
Merging I n Memory Output Run 2 Run 3 Run 1 Buffer On the disk
Merging � Scans are faster if read-ahead and write-behind are used; � Using large cluster sizes for the run files is very beneficial; � The number of runs W is typically not a power of Fan-in;
Merging as many as possible instead of always merging the same level
Quick Sort VS Replacement selection � Run files in RS are typically larger than memory ,as oppose to QS where they are the size of the memory � Qs results in burst of reads and writes for entire memory loads from the input file to initial run files while RS alternates between individual read and write � In RS memory management is more complex � The advantage of having fewer runs must be balanced with the different I/0 pattern and the disadvantage of more complex memory management.
2.2 Hashing � Alternative to sorting � Expected complexity of hashing algorithms is O(N) rather than O( N log N) as for sorting. � Hash-based query processing algorithms use an in-memory hash table of database objects to perform their matching task.
Hash Table overflow When hash table is larger than memory, hash table overflow occurs and must be dealt with. Input divided into multiple partition files such that partitions can be processed independently from one another Concatenation of results of all partitions is the result of the entire operation. 29
Discussion 2 � Question 2 � Fact1: Nowadays, data is larger and larger than before. � Fact2: Memory overflow was a serious problem to both sorting and hashing at the time when the paper was written. � Fact3: Machines are more powerful than the time of this paper: such as larger memory.
The question is… � For query execution in large database system: � Do you think that in the time since then the issues would have gotten � Better? Because memories have gotten larger? � Worse? Because there is a bigger gap between the time it takes to access memory and the time to access things on disk? � Do you think increasing power of the machines could deal with the large data processing or still worth making great effort on the field as this paper did? Are there any other new issues worth consideration?
3. Disk Access � File Scans � Associative Access Using Indices � Buffer Management
3.1 File scans � Make it fast -> Read ahead -> requires contiguous file allocation
Associative Access Using Indices � Goal: � To reduce the number of accesses to secondary storage � How? � By employing associative search techniques in the form of indices � Indices map key or attribute values to locator information with which database objects can be retrieved.
Recommend
More recommend