Indexing Ramakrishnan/Gehrke Ch. 8 “How index -learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander Pope (1688-1744) 340151 Big Databases & Cloud Services (P. Baumann) 1
Range Searches `` Find all students with gpa > 3.0 ’’ • sorted file (by gpa!), fixed-length records: binary search to find first student, then scan to find rest • Cost of binary search can be quite high Simple idea: Create an `index’ file containing only key values + search values • Can do binary search on (smaller) index file! Index File kN k1 k2 Data File tuple 1 tuple 2 tuple 3 tuple N 340151 Big Databases & Cloud Services (P. Baumann) 2
Indexes speeds up selections on predefined search key field(s) • one relation (~file) • Any attribute (except BLOB) can be search key for an index on the relation collection of data entries • For efficient retrieval of all data entries k* for given key value k Index vs sorted files • Both: search faster than just heap • Updates: index much faster 340151 Big Databases & Cloud Services (P. Baumann) 3
B+ Tree Indexes Index pages Leaf pages Ordered, balanced tree of degree m Non-leaf pages: index entries = keys & pointers fill factor P 0 K 1 P 1 K 2 P 2 K m P m Leaf pages: keys + data pointers; prev/next page chain 340151 Big Databases & Cloud Services (P. Baumann) 4
B+-Tree Definition B+-Tree of Order m has the following properties... • Property #1 - All leaf nodes must be at same level. • Property #2 - All nodes except root must have at least [m/2]-1 keys and maximum of m-1 keys. • Property #3 - All non leaf nodes except root (i.e. all internal nodes) must have at least m/2 children. • Property #5 - A non leaf node with n-1 keys must have n number of children. • key values in a node sorted in ascending order [http://btechsmartclass.com/data_structures/b-trees.html] 340151 Big Databases & Cloud Services (P. Baumann) 5
B+ Tree: Operations Root Note how data entries in leaf level are sorted 17 Entries < 17 Entries => 17 27 5 13 30 2* 3* 5* 7* 8* 22* 24* 27* 29* 33* 34* 38* 39* 14* 16* Find 28*? 29*? All > 15* and < 30*? Insert/delete: Find data entry in leaf, change it; adjust parent if needed • change sometimes bubbles up the tree O( log F N ) where F = fan-out, N = # leaf pages • https://www.cs.usfca.edu/~galles/visualization/BTree.html 340151 Big Databases & Cloud Services (P. Baumann) 6
More Exercises Consider • # node reads from disk – determines speed • # comparisons – not performance relevant, but for understanding mechanics Find 15, 20, 0 Find all 11 – 15; 20 – 32 [https://condor.depaul.edu/ntomuro/courses/417/notes/lecture3.html] 340151 Big Databases & Cloud Services (P. Baumann) 7
B+ Trees in Practice Typical fill-factor: 67% (outdated; today ~90%) Average fan-out: 133 Typical capacities: Height 3: 133 3 = 2,352,637 records • Height 4: 133 4 = 312,900,700 records • Can often hold top levels in buffer pool: Level 1 = 1 page = 8 Kbytes • Level 2 = 133 pages = 1 Mbyte • Level 3 = 17,689 pages = 133 MBytes • 340151 Big Databases & Cloud Services (P. Baumann) 8
Hash-Based Indexes Goal: compute address without disk access • get data in O(1) Idea: distribute data evenly into fixed number of “ buckets ” • Compute location from key via Hashing function • Ex: h(int r) = r* a mod b , b prime relative to a • overflow pages Hash index = bucket set + hashing function • Bucket = primary page + 0..n overflow pages only equality, no range queries [Shankai Yan] 340151 Big Databases & Cloud Services (P. Baumann) 9
Index Classification Primary index: index contains primary key, otherwise: secondary index • Unique index Clustered vs. unclustered: order of data records same as / `close to’ order of data entries clustered index • Cost varies greatly 340151 Big Databases & Cloud Services (P. Baumann) 10
Clustered vs. Unclustered Index To build clustered index, first sort Heap file Overflow pages may be needed for inserts • order of data recs `close to’, but not identical to, sort order Index entries UNCLUSTERED CLUSTERED direct search for data entries Index File Data entries Data file Data Records Data Records 340151 Big Databases & Cloud Services (P. Baumann) 11
Composite Search Keys Composite Search Keys = Examples of composite key indexes using lexicographic order: Search on combination of fields Equality query: • 11,80 11 Every field value equal to a constant value 12 12,10 Ex: for <sal,age> index: age=20 and sal=75 name age sal 12,20 12 Range query: • 20,75 bob 12 10 20 Some field value within interval <age, sal> <age> cal 11 80 Ex: age>20; or age=20 and sal>10 joe 12 20 Data entries in index sorted by search 10,12 sue 20 75 10 key to support range queries 20 20,12 Data records sorted by name 75,20 75 80 80,11 <sal, age> <sal> Data entries in index Data entries sorted by <sal,age> sorted by <sal> 340151 Big Databases & Cloud Services (P. Baumann) 12
Composite Search Keys: How To Use To retrieve Emp records with age =30 AND sal =4000: • index on < age,sal > better than index on age, or index on sal • Choice of index key orthogonal to clustering etc. If condition is: 20< age <30 AND 3000< sal <5000: Clustered tree index on < age,sal > or < sal,age > is best • If condition is: age =30 AND 3000< sal <5000: Clustered < age,sal > index much better • than < sal,age > index! Composite indexes are larger, updated more often 340151 Big Databases & Cloud Services (P. Baumann) 13
Index-Only Plans A number of queries can be answered without retrieving any tuples from one or more of the relations involved …if a suitable index is available SELECT E.dno, COUNT (*) < E.dno > FROM Emp E GROUP BY E.dno SELECT E.dno, MIN (E.sal) < E.dno,E.sal > FROM Emp E Tree index! GROUP BY E.dno < E. age,E.sal > SELECT AVG (E.sal) or FROM Emp E < E.sal, E.age > WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000 Tree index! 340151 Big Databases & Cloud Services (P. Baumann) 14
Index-Only Plans (contd.) Index-only plans possible SELECT E.dno, COUNT (*) if key is <dno,age> or <age,dno> FROM Emp E • Which is better? WHERE E.age=30 GROUP BY E.dno 340151 Big Databases & Cloud Services (P. Baumann) 15
Index-Only Plans (contd.) Index-only plans possible SELECT E.dno, COUNT (*) if key is <dno,age> or <age,dno> FROM Emp E • Which is better? WHERE E.age=30 GROUP BY E.dno SELECT E.dno, COUNT (*) • What if we consider the second query? FROM Emp E WHERE E.age>30 GROUP BY E.dno 340151 Big Databases & Cloud Services (P. Baumann) 16
Index-Only Plans (Contd.) Index-only plans also for queries involving >1 table < E.dno > < E.dno,E.eid > SELECT D.dno FROM Dept D, Emp E SELECT D.dno, E.eid WHERE D.dno=E.dno FROM Dept D, Emp E WHERE D.dno=E.dno 340151 Big Databases & Cloud Services (P. Baumann) 17
Understanding the Workload For each query in workload: • Which relations accessed? • Which attributes retrieved? • Which attributes involved in selection/join conditions? How selective likely? For each update in workload: • Which attributes involved in selection/join conditions? How selective likely? • Type of update (INSERT/DELETE/UPDATE) + attributes affected Trade-off: Indexes can make queries faster, updates slower • …and require disk space 340151 Big Databases & Cloud Services (P. Baumann) 18
Index Selection Guidelines Attributes in WHERE clause candidates for index keys • Exact match condition hash index • Range query tree index • Clustering especially useful for range queries • can also help on equality queries if many duplicates Multi-attribute search keys should be considered when a WHERE clause contains several conditions • Order of attributes is important for range queries • Such indexes can sometimes enable index-only strategies for important queries • For index-only strategies, clustering is not important! 340151 Big Databases & Cloud Services (P. Baumann) 19
Index Selection Guidelines (contd.) Choose indexes that benefit as many queries as possible • impact on updates: Indexes make queries faster, updates slower • require disk space! only one index can be clustered per relation choose based on important queries that benefit most from clustering understand how DBMS evaluates queries & creates query evaluation plans …a practitioner's approach: • Consider most important queries in turn: • Consider best plan using current indexes • See if a better plan is possible with an additional index • If so, create it 340151 Big Databases & Cloud Services (P. Baumann) 20
Recommend
More recommend