indexing
play

Indexing Ramakrishnan/Gehrke Ch. 8 How index -learning turns no - PowerPoint PPT Presentation

Indexing Ramakrishnan/Gehrke Ch. 8 How index -learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) 340151 Big Databases & Cloud Services (P. Baumann) 1 Range Searches `` Find


  1. Indexing Ramakrishnan/Gehrke Ch. 8 “How index -learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander Pope (1688-1744) 340151 Big Databases & Cloud Services (P. Baumann) 1

  2. Range Searches  `` Find all students with gpa > 3.0 ’’ • sorted file (by gpa!), fixed-length records: binary search to find first student, then scan to find rest • Cost of binary search can be quite high  Simple idea: Create an `index’ file containing only key values + search values • Can do binary search on (smaller) index file! Index File kN k1 k2 Data File tuple 1 tuple 2 tuple 3 tuple N 340151 Big Databases & Cloud Services (P. Baumann) 2

  3. Indexes  speeds up selections on predefined search key field(s) • one relation (~file) • Any attribute (except BLOB) can be search key for an index on the relation  collection of data entries • For efficient retrieval of all data entries k* for given key value k  Index vs sorted files • Both: search faster than just heap • Updates: index much faster 340151 Big Databases & Cloud Services (P. Baumann) 3

  4. B+ Tree Indexes Index pages Leaf pages  Ordered, balanced tree of degree m  Non-leaf pages: index entries = keys & pointers fill factor P 0 K 1 P 1 K 2 P 2 K m P m  Leaf pages: keys + data pointers; prev/next page chain 340151 Big Databases & Cloud Services (P. Baumann) 4

  5. B+-Tree Definition  B+-Tree of Order m has the following properties... • Property #1 - All leaf nodes must be at same level. • Property #2 - All nodes except root must have at least [m/2]-1 keys and maximum of m-1 keys. • Property #3 - All non leaf nodes except root (i.e. all internal nodes) must have at least m/2 children. • Property #5 - A non leaf node with n-1 keys must have n number of children. • key values in a node sorted in ascending order [http://btechsmartclass.com/data_structures/b-trees.html] 340151 Big Databases & Cloud Services (P. Baumann) 5

  6. B+ Tree: Operations Root Note how data entries in leaf level are sorted 17 Entries < 17 Entries => 17 27 5 13 30 2* 3* 5* 7* 8* 22* 24* 27* 29* 33* 34* 38* 39* 14* 16*  Find 28*? 29*? All > 15* and < 30*?  Insert/delete: Find data entry in leaf, change it; adjust parent if needed • change sometimes bubbles up the tree O( log F N ) where F = fan-out, N = # leaf pages • https://www.cs.usfca.edu/~galles/visualization/BTree.html 340151 Big Databases & Cloud Services (P. Baumann) 6

  7. More Exercises  Consider • # node reads from disk – determines speed • # comparisons – not performance relevant, but for understanding mechanics  Find 15, 20, 0  Find all 11 – 15; 20 – 32 [https://condor.depaul.edu/ntomuro/courses/417/notes/lecture3.html] 340151 Big Databases & Cloud Services (P. Baumann) 7

  8. B+ Trees in Practice  Typical fill-factor: 67% (outdated; today ~90%)  Average fan-out: 133  Typical capacities: Height 3: 133 3 = 2,352,637 records • Height 4: 133 4 = 312,900,700 records •  Can often hold top levels in buffer pool: Level 1 = 1 page = 8 Kbytes • Level 2 = 133 pages = 1 Mbyte • Level 3 = 17,689 pages = 133 MBytes • 340151 Big Databases & Cloud Services (P. Baumann) 8

  9. Hash-Based Indexes  Goal: compute address without disk access • get data in O(1)  Idea: distribute data evenly into fixed number of “ buckets ” • Compute location from key via Hashing function • Ex: h(int r) = r* a mod b , b prime relative to a • overflow pages  Hash index = bucket set + hashing function • Bucket = primary page + 0..n overflow pages  only equality, no range queries [Shankai Yan] 340151 Big Databases & Cloud Services (P. Baumann) 9

  10. Index Classification  Primary index: index contains primary key, otherwise: secondary index • Unique index  Clustered vs. unclustered: order of data records same as / `close to’ order of data entries  clustered index • Cost varies greatly 340151 Big Databases & Cloud Services (P. Baumann) 10

  11. Clustered vs. Unclustered Index  To build clustered index, first sort Heap file  Overflow pages may be needed for inserts • order of data recs `close to’, but not identical to, sort order Index entries UNCLUSTERED CLUSTERED direct search for data entries Index File Data entries Data file Data Records Data Records 340151 Big Databases & Cloud Services (P. Baumann) 11

  12. Composite Search Keys  Composite Search Keys = Examples of composite key indexes using lexicographic order: Search on combination of fields Equality query: • 11,80 11 Every field value equal to a constant value 12 12,10 Ex: for <sal,age> index: age=20 and sal=75 name age sal 12,20 12 Range query: • 20,75 bob 12 10 20 Some field value within interval <age, sal> <age> cal 11 80 Ex: age>20; or age=20 and sal>10 joe 12 20  Data entries in index sorted by search 10,12 sue 20 75 10 key to support range queries 20 20,12 Data records sorted by name 75,20 75 80 80,11 <sal, age> <sal> Data entries in index Data entries sorted by <sal,age> sorted by <sal> 340151 Big Databases & Cloud Services (P. Baumann) 12

  13. Composite Search Keys: How To Use  To retrieve Emp records with age =30 AND sal =4000: • index on < age,sal > better than index on age, or index on sal • Choice of index key orthogonal to clustering etc.  If condition is: 20< age <30 AND 3000< sal <5000: Clustered tree index on < age,sal > or < sal,age > is best •  If condition is: age =30 AND 3000< sal <5000: Clustered < age,sal > index much better • than < sal,age > index!  Composite indexes are larger, updated more often 340151 Big Databases & Cloud Services (P. Baumann) 13

  14. Index-Only Plans  A number of queries can be answered without retrieving any tuples from one or more of the relations involved …if a suitable index is available SELECT E.dno, COUNT (*) < E.dno > FROM Emp E GROUP BY E.dno SELECT E.dno, MIN (E.sal) < E.dno,E.sal > FROM Emp E Tree index! GROUP BY E.dno < E. age,E.sal > SELECT AVG (E.sal) or FROM Emp E < E.sal, E.age > WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000 Tree index! 340151 Big Databases & Cloud Services (P. Baumann) 14

  15. Index-Only Plans (contd.)  Index-only plans possible SELECT E.dno, COUNT (*) if key is <dno,age> or <age,dno> FROM Emp E • Which is better? WHERE E.age=30 GROUP BY E.dno 340151 Big Databases & Cloud Services (P. Baumann) 15

  16. Index-Only Plans (contd.)  Index-only plans possible SELECT E.dno, COUNT (*) if key is <dno,age> or <age,dno> FROM Emp E • Which is better? WHERE E.age=30 GROUP BY E.dno SELECT E.dno, COUNT (*) • What if we consider the second query? FROM Emp E WHERE E.age>30 GROUP BY E.dno 340151 Big Databases & Cloud Services (P. Baumann) 16

  17. Index-Only Plans (Contd.)  Index-only plans also for queries involving >1 table < E.dno > < E.dno,E.eid > SELECT D.dno FROM Dept D, Emp E SELECT D.dno, E.eid WHERE D.dno=E.dno FROM Dept D, Emp E WHERE D.dno=E.dno 340151 Big Databases & Cloud Services (P. Baumann) 17

  18. Understanding the Workload  For each query in workload: • Which relations accessed? • Which attributes retrieved? • Which attributes involved in selection/join conditions? How selective likely?  For each update in workload: • Which attributes involved in selection/join conditions? How selective likely? • Type of update (INSERT/DELETE/UPDATE) + attributes affected  Trade-off: Indexes can make queries faster, updates slower • …and require disk space 340151 Big Databases & Cloud Services (P. Baumann) 18

  19. Index Selection Guidelines  Attributes in WHERE clause candidates for index keys • Exact match condition hash index • Range query tree index • Clustering especially useful for range queries • can also help on equality queries if many duplicates  Multi-attribute search keys should be considered when a WHERE clause contains several conditions • Order of attributes is important for range queries • Such indexes can sometimes enable index-only strategies for important queries • For index-only strategies, clustering is not important! 340151 Big Databases & Cloud Services (P. Baumann) 19

  20. Index Selection Guidelines (contd.)  Choose indexes that benefit as many queries as possible • impact on updates: Indexes make queries faster, updates slower • require disk space!  only one index can be clustered per relation choose based on important queries that benefit most from clustering  understand how DBMS evaluates queries & creates query evaluation plans  …a practitioner's approach: • Consider most important queries in turn: • Consider best plan using current indexes • See if a better plan is possible with an additional index • If so, create it 340151 Big Databases & Cloud Services (P. Baumann) 20

Recommend


More recommend