Performance Analysis: lookup/insert a search key Assumption: The grid index file can be store in memory Insert performance In addition to the lookup cost 1 more block write operation to update the bucket block If overflow , need to update the overflow link in the bucket and write an overflow block ) 49 / 149
Using a grid index in multi-dimensional queries Performance of Grid index for the commonly used multi-dimensional queries Assumption: The grid index file can be stored entirely in memory 50 / 149
1) Partial Match queries The query specifies conditions on some dimensions but not on all dimensions Find all jewelry purchases by people with age = 50 You will access m disk blocks ( m is some dimension of the grid) 51 / 149
2) Range queries Find objects that are located either partial or wholly within a certain range Find all jewelry purchases by people whose 35 ≤ age ≤ 50 , 50 K ≤ salary ≤ 100 K In this example, we must access 4 disk blocks 52 / 149
Announcement Coding assignment 2 due date: Sunday, March 11, 2018 by midnight (Chicago time:) Quiz 1: Post: Friday February 23. Due on Blackboard: Tuesday February 27 by midnight (Chicago time) Midterm: Close notes/book/friends: March 5 in class time 53 / 149
3) Nearest neighbor queries Find the nearest neighbor of a data point 54 / 149
3) Nearest neighbor queries Start by finding the nearest neighbor in the bucket that contains the data point 55 / 149
3) Nearest neighbor queries This distance will limit the block where you need to search to all blocks that intersect with this circle: 56 / 149
3) Nearest neighbor queries Expand the search region in an adjacent bucket that contained within the circle: 57 / 149
3) Nearest neighbor queries And so forth 58 / 149
3) Nearest neighbor queries And so forth 59 / 149
3) Nearest neighbor queries Note: You may need to expand the search range beyond the adjacent regions The nearest neighbor is outside the adjacent regions You must use the current nearest neighbor and the grid lines to decide whether you need to expend the range of the search 60 / 149
3) Nearest neighbor queries: Performance The expanding range search will access on average 9 data blocks (in a 2-dimensional grid index) 61 / 149
4) Where-am-I queries Given a location (i.e., coordinate) Find the object(s ) that contains the location Grid index cannot represent objects (can only present points) ⇒ Grid Index cannot handle Where-am-I type of queries The only kind of index that can handle Where-am-I queries is the R-tree (Region-tree) (Discussed later) 62 / 149
Grid Index: Summary + Good for multiple-key search - Space, management overhead (nothing is free) - Need partitioning ranges that evenly split keys 63 / 149
Grid Index A major problem with Grid Index files is Poor occupancy rate at many grid buckets Especially when you have 3 or more dimensions. You will have many buckets that are empty. 64 / 149
Multi-dimensional index structures Hash like structures Grid files Partitioned Hash functions Tree like structures Multiple key indexes kd-trees Quad trees R-trees 65 / 149
Partitioned Hashing Traditional hashing Problem with traditional hashing If the key is composite and some component of the key is not known we cannot compute a meaningful hash value at all 66 / 149
Partitioned Hashing Partitioned Hashing The key is a composite: Use n hash functions , one function on one component 67 / 149
Partitioned Hashing Partitioned Hashing The hash value is the concatenation of the individual hash function values 68 / 149
Partitioned Hashing: Example 69 / 149
Advantage of Partitioned Hashing Partitioned Hashing can generate a meaningful hash value for incomplete keys 70 / 149
Partitioned Hashing: A complete example Data on people who buy jewelry Given hash functions Some Hash Function values 71 / 149
Partitioned Hashing: A complete example The Partitioned Hash index 72 / 149
Using a Partitioned Hashing The Partitioned Hash index 73 / 149
1) Partial Match queries Find people with age = 50 Age = 50 will hash to the hash value Hash(age) = 0 × × . Start at bucket 000 and scan to bucket 011 74 / 149
2) Range queries Find objects that are located either partial or wholly within a certain range Find people such that: 35 ≤ age ≤ 50 , 50 K ≤ salary ≤ 100 K 75 / 149
2) Range queries a) Hash all values inside the range Note: the block pointers can have duplicates b) Collect all the buckets (eliminate duplicate block pointers) c) Access all (unique) buckets (disk blocks) ⇒ Hashing is not appropriate for range type queries 76 / 149
3) Nearest neighbor queries Hashing is completely useless for nearest neighbor type queries Because: There is no notion of distance in the hash function Example: find records that with distance ≤ 1 to search key = 1 We hash the search key 1 77 / 149
3) Nearest neighbor queries However, we cannot use the distance in the hash table to locate “nearby” objects (records) The value 2 is near the value 1, but may get hash very far away 78 / 149
Property of hashing: Closeness of bucket indexes has nothing to do with real distance between data points (because hashing computes a random number) 79 / 149
4) Where-am-I queries Hashing is also not useful here either Because hashing provide no information on distance 80 / 149
Advantage of Partitioned Hashing Good hash functions will randomize the records ⇒ Partitioned hashing will achieve good occupancy rate per bucket 81 / 149
Multi-dimensional index structures Hash like structures Grid files Partitioned Hash functions Tree like structures Multiple key indexes kd-trees Quad trees R-trees 82 / 149
Multiple-key index special case of a multilevel index using different types of search keys in each level 83 / 149
Multiple-key index: Example Data on people who buy jewelry A multiple-key index on keys (age, salary) 84 / 149
Using a Multiple-key index: 1) Partial Match queries Find all people with age = 25 Use the index on age to find the index block(s) for age = 25 Then, scan all entries in the salary index file (list of blocks) indexed by age= 25 to find the records 85 / 149
1) Partial Match queries Multiple-key index for partial match query will only be useful when the first dimension is given We cannot use multiple-key index to process the following query efficiently 86 / 149
1) Partial Match queries Find all people who earn $60,000 who buy jewelry. We will need to scan the first index Result: many disk accesses 87 / 149
2) Range queries Find objects that are located either partial or wholly within a certain range Find people such that: 35 ≤ age ≤ 50 , 50 K ≤ salary ≤ 100 K 88 / 149
2) Range queries Use the range of age to find all of the subindexes that might contain answer Only need to search a limited number of lower level index files 89 / 149
3) Nearest neighbor queries The multiple key index can help in the processing of Nearest neighbor queries BUT: It involves a complicated expanding range search algorithm in “nearby branches” of the index tree 90 / 149
4) Where-am-I queries Multiple-key index are not used in Where-am-I queries 91 / 149
Multi-dimensional index structures Hash like structures Grid files Partitioned Hash functions Tree like structures Multiple key indexes kd-trees Quad trees R-trees 92 / 149
kd (k-dimensional) tree: The kd-tree as a main memory data structure Adaptation of the kd-tree for disk storage 93 / 149
Review: Binary Search Tree Binary Search Tree (BST) is a binary tree where The values in the nodes in the left subtree of the node x in the tree has a smaller value than x The values in the nodes in the right subtree of the node x in the tree has a greater value than x Notice the above property holds for every node in the binary tree 94 / 149
Review: Binary Search Tree: Example 95 / 149
Review: Binary Search Tree: Example 96 / 149
The kd-tree The kd-tree is a generalization of the classic Binary Search Tree (BST) The search key used at different levels belongs to a different dimension (domain) The dimensions at different levels will wrap around (i.e., circulate) 97 / 149
Example: a 2-dimensional kd-tree 2 dimentions: x and y 98 / 149
Properties Subtrees of x 1 must satisfy this property 99 / 149
Properties Subtrees of y 1 and y 2 must satisfy this property And so on (for every level of the kd-tree ) 100 / 149
Recommend
More recommend