Database System Architecture Index Structures Hector Garcia-Molina Stijn Vansummeren
Index structure • Any data structure that takes as input a search key and efficiently returns the collection of matching records
Sequential File 10 20 30 40 50 60 70 80 90 100
Sequential File Dense Index 10 10 20 20 30 30 40 40 50 50 60 60 70 80 70 80 90 100 90 110 100 120
Sequential File Sparse Index 10 10 20 30 50 30 70 40 90 50 110 60 130 150 70 80 170 190 90 210 100 230
Sequential File Sparse 2nd level 10 10 10 20 90 30 170 50 30 250 70 40 90 330 50 110 410 60 130 490 150 70 570 80 170 190 90 210 100 230
Question: • Can we build a dense, 2nd level index for a dense index?
Sparse vs. Dense Tradeoff • Sparse: Less index space per record can keep more of index in memory • Dense: Can tell if any record exists without accessing file (Later: – sparse better for insertions – dense needed for secondary indexes)
Next: • Duplicate keys • Deletion/Insertion • Secondary indexes
Duplicate keys 10 10 10 20 20 30 30 30 40 45
Duplicate keys Dense index, one way to implement? 10 10 10 10 10 10 10 10 10 10 10 10 20 20 20 20 20 20 20 20 30 30 30 30 30 30 30 30 30 30 30 30 40 40 45 45
Duplicate keys Dense index, better way? 10 10 10 20 10 30 20 40 20 30 30 30 40 45
Duplicate keys Sparse index, one way? careful if looking 10 10 10 for 20 or 30! 10 10 20 20 30 20 30 30 30 40 45
Duplicate keys Sparse index, another way? place first new key from block 10 10 10 20 10 30 20 30 20 30 30 30 40 45
Deletion from sparse index 10 10 20 30 30 50 40 70 50 90 60 110 130 70 150 80
Deletion from sparse index ● delete record 40 10 10 20 30 30 50 40 70 50 90 60 110 130 70 150 80
Deletion from sparse index ● delete record 30 10 10 20 40 30 40 30 50 40 70 50 90 60 110 130 70 150 80
Deletion from sparse index ● delete records 30 & 40 10 10 20 50 30 30 50 70 40 70 50 90 60 110 130 70 150 80
Deletion from dense index 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80
Deletion from dense index ● delete record 30 10 10 20 20 30 40 40 30 40 40 50 50 60 60 70 70 80 80
Insertion, sparse index case 10 10 20 30 30 40 60 40 50 60
Insertion, sparse index case ● insert record 34 10 10 20 30 30 40 34 60 40 50 60 our lucky day! we have free space where we need it!
Insertion, sparse index case ● insert record 15 10 10 20 15 20 30 30 20 40 30 60 40 50 • Illustrated: Immediate 60 reorganization • Variation: ● insert new block (chained file) ● update index
Insertion, sparse index case ● insert record 25 10 25 10 20 30 30 overflow blocks 40 60 (reorganize later...) 40 50 60
Insertion, dense index case • Similar • Often more expensive . . .
Secondary indexes Sequence field 30 50 20 70 80 40 100 10 90 60
Secondary indexes Sequence • Sparse index field 30 30 50 20 80 20 100 70 80 90 40 ... 100 10 does not make sense! 90 60
Secondary indexes Sequence • Dense index field 10 30 20 50 30 10 20 40 50 70 90 50 ... 80 60 40 sparse 70 ... high 100 10 level 90 60
With secondary indexes: • Lowest level is dense • Other levels are sparse Also: Pointers are record pointers (not block pointers; not computed)
Duplicate values & secondary indexes 20 10 20 40 10 40 10 40 30 40
Duplicate values & secondary indexes one option... 20 10 10 10 10 Problem: 20 20 40 excess overhead! 20 ● disk space 10 30 40 40 ● search time 40 10 40 40 40 30 ... 40
Duplicate values & secondary indexes another option... 20 10 10 20 Problem: 20 40 variable size 10 records in 40 30 40 index! 10 40 30 40
Duplicate values & secondary indexes 20 10 10 20 20 30 40 40 10 50 40 60 ... 10 40 30 40 buckets
Why “bucket” idea is useful Indexes Records Name: primary EMP (name,dept,floor,...) Dept: secondary Floor: secondary
Query: Get employees in (Toy Dept) ^ (2nd floor) Dept. index EMP Floor index Toy 2nd → Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s
This idea used in text information retrieval Document cat s ...the cat is fat ... dog ...was raining cats and dogs... ...Fido the dog ... Inverted lists
IR QUERIES • Find articles with “cat” and “dog” • Find articles with “cat” or “dog” • Find articles with “cat” and not “dog” • Find articles with “cat” in title • Find articles with “cat” and “dog” within 5 words
Summary so far • Conventional index – Basic Ideas: sparse, dense, multi- level… – Duplicate Keys – Deletion/Insertion – Secondary indexes – Buckets of Postings List
Outline/summary • Conventional Indexes • Sparse vs. dense • Primary vs. secondary • B trees --> Next • B+trees vs. indexed sequential • Hashing schemes
Conventional indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Inserts expensive, and/or - Lose sequentiality & balance
Example Index (sequential) 10 39 20 31 30 35 33 36 continuous 40 50 60 32 38 34 free space 70 80 90 overflow area (not sequential)
• NEXT: Another type of index – Give up on sequentiality of index – Try to get “balance”
B+Tree Example 3 5 11 30 30 35 100 101 110 Root 100 120 130 150 120 156 150 179 180 n=3 180 200
Sample non-leaf 57 81 95 to keys to keys to keys to keys 81 ≤ k<95 ≥ 95 57 ≤ k<81 < 57
Sample leaf node: 57 To record with key 57 To record 81 with key 81 From non-leaf node To record 95 with key 85 in sequence to next leaf
In textbook’s notation n=3 Leaf: 30 35 30 35 Non-leaf: 30 30
3 5 Lookup record(s) with key = 35 11 30 30 35 100 101 110 Root 100 120 130 150 120 156 150 179 180 180 200 n=3
3 5 Lookup record(s) with key = 40 11 30 30 35 100 101 110 Root 100 120 130 150 120 156 150 179 180 180 200 n=3
Range query: lookup record(s) with 35 <= key <= 150 n=3 Root 100 120 150 180 30 100 101 110 120 130 150 156 179 180 200 11 30 35 3 5
• The I/O cost of a lookup in a BTree is equal to longest path of the root to a leaf • Hence, the goal is to keep this longest path as short as possible • In particular: we want all leafs to be at the same depth in the tree (and hence want a balanced tree)
Size of nodes: n+1 pointers (fixed) n keys
Don’t want nodes to be too empty • Use at least (n+1)/2 Non-leaf: pointers Leaf : (n+1)/2 pointers to data
Leaf Non-leaf n=3 node Full node 3 120 5 150 11 180 30 30 35 min. counts even if null
B+tree rules tree of order d (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”
(3) Number of pointers/keys for B+tree Max Max Min Min ptrs keys ptrs → data keys Non-leaf (n+1)/ 2 (n+1)/ 2 - 1 (non-root) n+1 n Leaf (n+ 1) / 2 (n+ 1) / 2 (non-root) n+1 n Root n+1 n 2 1
Insert into B+tree (a) simple case – space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root
(a) Insert key = 32 n=3 100 30 11 30 31 32 3 5
(a) Insert key = 7 n=3 100 30 7 7 11 30 31 3 5 3 5
(c) Insert key = 160 100 160 150 120 156 150 179 180 160 179 180 n=3 180 200
(d) New root, insert n=3 45 new root 30 10 20 30 40 10 12 20 25 30 32 40 40 45 1 2 3
Deletion from B+tree (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf
(b) Coalesce with n=4 sibling – Delete 50 100 10 40 40 10 20 30 40 50
(c) Redistribute keys n=4 – Delete 50 35 100 10 40 35 10 20 30 35 40 50
(d) Non-leaf coalese n=4 – Delete 37 25 new root 40 25 10 20 30 40 30 30 37 10 14 20 22 25 26 40 45 1 3
Outline/summary • Conventional Indexes • Sparse vs. dense • Primary vs. secondary • B trees • B+trees vs. indexed sequential • Hashing schemes --> Next
Recommend
More recommend