indexing
play

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content - PowerPoint PPT Presentation

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content Motivation for Indexing B-tree B-tree basics The cost of B-tree operations B-tree variants B-tree in multi-user Environments Learned Index


  1. Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu

  2. Content ● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index

  3. Motivation for Indexing Activity Question: Why do we need indexing?

  4. Motivation for Indexing Activity Question: Why do we need indexing? ● Items are retrieved from secondary storage to memory before processed. ● Organizing files intelligently makes the retrieval process efficient. ● Large, randomly accessed file in a computer system is associated with index ○ which like the labels on the drawers ○ directing the searcher to the small part of the file containing the desired item.

  5. Content ● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index

  6. Operations on a file ● Files: set of records k 0 α 0 ● Each record: r i = ( k i , α i ), where k i is the key k 1 α 1 and α i is the associated information k 2 α 2 ● Operations ○ Insert: add new record, ( k i , α i ), checking k i is unique. ... ... ○ Delete: remove record, ( k i , α i ), given k i ○ Find: retrieve α i , given k i . ○ Next: retrieve α i+1 , given that α i was just retrieved.

  7. B-tree: Generalization of Binary Search Tree ● More than 2 paths leave a given node. ● Compare query key and the key stored at the node the decide path to take. ● Exact match (success). No exact match and leaf is reached (failure).

  8. B-tree of Order d ● Each node contains at most 2d keys and 2d + 1 pointers. ● Each node contains at least d keys and d + 1 points (at least ½ full).

  9. Balancing B-Tree: ● Never visits more than 1 + log d (n) node. ● Accessing each node is a separate access to secondary storage.

  10. Insertion 1. Find: proceeds from root to location the proper leaf for insertion. 2. Insert: balance is restored by a procedure which moves from the leaf back towards the root.

  11. Insertion: Split Of the 2d + 1 keys, the smallest d are placed in one node, the largest d are placed in another node, and the remaining value is promoted to the parent node as separator. The splitting can propagtes to root and the tree increase height by 1.

  12. Deletion Find proper node. There are two possibilities: 1. The key to be deleted resides in a leaf 2. The key resides in a nonleaf node. a. An adjance key be found and swapped into the vacated position. b. Use the leftmost leaf in the right subtree.

  13. Deletion: Underflow After the removal, check to see at least d keys remain in each node. If a node has less than d keys, then underflow is said to occur and redistribution of the keys becomes necessary.

  14. Deletion: Concatenation ● Redistribution of keys among two neighbors only there are at least 2d keys. ● When there are less than 2d keys remain, a concatenation must occur. ○ Keys are simply combined into one of the nodes and the other is discarded. ○ Since only one node remains, the key separating the two nodes in the ancestor is no longer necessary and added to the single remaining leaf. ○ If the descendants of the root are concatenated, they form a new root, decrease B-tree height by 1.

  15. Content ● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index

  16. The cost of operations ● Retrieval costs ● Insertion and Deletion costs ● Sequential Processing

  17. Retrieval costs ● Find operation grows as the logarithm of the file size. ● With d being order of the B-tree, n being number of keys in the file, h being the height of the tree:

  18. Insertion and Deletion costs - Tree Height ● May require additional secondary storage accesses beyond the cost of a find operation as it progresses back up the tree. ● Overall, the costs are at most doubled, so the height of the tree still dominates the cost. ● In a B-tree of order d for a file of n records, insertion and deletion take time proportional to log d (n) in the worst case.

  19. Insertion and Deletion costs - Tree Order ● As the branch factor, d, increases, the logarithmic base increases, the cost of find, insert and delete operation delete decreases. ● There are practical limits on the size of a node. ○ Most hardware systems bound the amount of data that can be transferred with one access to secondary storage. ○ The cost estimation is now hiding the constant factor which grows as the size of data transferred increases.

  20. Sequential Processing ● Using the next operation to process all records in key-sequence order. ● B-tree may not do well in sequential processing ○ Preorder tree walk requires space for at least h = log d (n+1) nodes in main memory since it stacks the nodes along a path from the root to avoid reading them twice ○ Processing a next operation may require tracing a path through several nodes before reaching the desired key. ● B+-tree improves sequential processing performance.

  21. Content ● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index

  22. B-Tree variants ● Different variations ○ Splitting vs. Redistributed to neighbor ○ Processing a node once it has been retrieved from secondary storage, using different search method (e.g. linear search, binary search) ○ Varying “order” at each depth ● B*-Trees ● B+-Trees

  23. B*-Trees ● Each node is at least ⅔ full (instead of just ½ full). ● Delay spitting until 2 sibling nodes are full and then divided into 3 each ⅔ full. ● Increasing storage utilization. ● Speeding up search as height of the tree is reduced.

  24. B+-Trees structure ● All keys reside in the leave. ● Nonleaf levels are organized as B-tree, consist only index. All keys reside in leaves. ● Leaf nodes are usually linked together left-to-right.

  25. B+-Tree Operations ● Insertion: ○ Almost identically to B-tree. ○ During a split, instead of promoting the middle key, promote a copy of the key. ● Deletion: ○ key to be deleted always reside in leaf node, which makes deletion simple. ○ As long as the leaf remains at least half full, the upper index levels does not need change. ● Find: ○ Search does not stop on exact match, instead the right pointer is followed. ○ Almost proceeds all the way to a leaf.

  26. Content ● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index

  27. B-tree in Multiuser Environment ● Should permit several user requests to be processed simultaneously. ● One process may read a node and follow one of the links while another process is changing it. ● Find operations goes top down, while insertion and deletion require bottom-up access.

  28. B-tree in Multiuser Environment: Locking ● Find operation ○ locks a node once it has been read ○ Release when search proceed to next level ○ Readers locks at most two nodes at any time. ● Update operation ○ Reservation on access ○ Reservation converted to an absolute lock if update changes will propagate to the reserved node, otherwise cancel reservation ○ Reserved node may be read but may not be reserved a second time

  29. B-tree in Multiuser Environment: Security ● Protection of information in a multiuser environment. ● Memory protection mechanism of paging. ● Encryption techniques can be used to protect contents of a file outside of the underlying system.

  30. Summary of B-tree ● Efficient, simple and easily maintained. ● Logarithmic cost find insert and delete operations. ● Guarantee 50% storage utilization. ● B+-tree allow efficient sequential processing. ● There are many variants of B-tree. ● Can be used in multiuser environment.

  31. Content ● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index

  32. Indexes as Models ● B-Tree Index : Maps key to position of record in sorted array ● Hash Index : Maps key to position of record in unsorted array ● BitMap Index : Checks if a data-record exists

  33. Indexes as Models ● B-Tree Index : Maps key to position of record in sorted array ● Hash Index : Maps key to position of record in unsorted array ● BitMap Index : Checks if a data-record exists Can we replace these traditional models with other kinds of models?

  34. Activity If we have fixed length records with continuous integer keys from 1 to 1 million, can we find a better way to access record corresponding to any given key? What if the length of each record was one unit greater than its immediate predecessor?

  35. Knowing data distribution helps ! ● ML, especially neural nets, can learn variety of data distributions, mixtures and other patterns ● Balancing complexity of model with accuracy is important

  36. What should the model learn?

Recommend


More recommend