CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing - PowerPoint PPT Presentation

CSE 232A   Graduate Database Systems Arun Kumar Topic 2: Indexing and Sorting Chapters 10, 11, and 13 of Cow Book Slide ACKs: Jignesh Patel, Paris Koutris 1

Motivation for Indexing Consider the following SQL query: ❖ Movies (M) MovieID Name Year Director SELECT * FROM Movies WHERE Year=2017 Q: How to obtain the matching records from the file? Heap file? Need to do a linear scan! O(N) I/O and CPU ❖ “Sorted” file? Binary search! O(log 2 (N)) I/O and CPU ❖ Indexing helps retrieve records faster for selective predicates! SELECT * FROM Movies WHERE Year>=2000 AND Year<2010 2

Another View of Storage Manager Access Methods Recovery Manager Control Manager Concurrency Sorted Hash File Index B+-tree Heap Index File Buffer Manager I/O Manager I/O Accesses 3

Indexing: Outline Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index ❖ 4

Indexing Index : A data structure to speed up record retrieval ❖ Search Key : Attribute(s) on which file is indexed; also ❖ called Index Key (used interchangeably) Any permutation of any subset of a relation’s attributes ❖ can be index key for an index Index key need not be a primary/candidate key ❖ Two main types of indexes: ❖ B+ Tree index: good for both range and equality search ❖ Hash index: good for equality search ❖ 5

Overview of Indexes Need to consider efficiency of search, insert, and delete ❖ Primarily optimized to reduce (disk) I/O cost ❖ B+ Tree index: ❖ O(log F (N)) I/O and CPU cost for equality search (N: ❖ number of “data entries”; F: “fanout” of non-leaf node) Range search, Insert, and Delete all start with an ❖ equality search Hash index: ❖ O(1) I/O and CPU cost for equality search ❖ Insert and delete start with equality search ❖ Not “good” for range search! ❖ 6

What is stored in the Index? 2 things: Search/index key values and data entries ❖ Alternatives for data entries for a given key value k : ❖ AltRecord : Actual data records of file that match k ❖ AltRID : < k , RID of a record that matches k > ❖ AltRIDlist : < k , list of RIDs of records that match k> ❖ API for operations on records: ❖ Search (IndexKey); could be a predicate for B+Tree ❖ Insert (IndexKey, data entry) ❖ Delete (IndexKey); could be a predicate for B+Tree ❖ 7

Overview of B+ Tree Index Index Entries Entries of the form: (Non-leaf pages) (IndexKey value, PageID) Entries of the form: AltRID: (IndexKey value, RID) Data Entries (Leaf pages) Non-leaf pages do not contain data values; they ❖ contain [d, 2d] index keys; d is order parameter Height-balanced tree; only root can have [1,d) keys ❖ Leaf pages in sorted order of IndexKey; connected as a ❖ doubly linked list Q: What is the difference between “B+ Tree” and “B Tree”? 8

Overview of Hash Index Bucket pages 0 Hash function 1 SearchKey h N-1 Overflow Primary pages bucket pages Bucket pages have data entries (same 3 Alternatives) ❖ Hash function helps obtain O(1) search time ❖ 9

Trade-offs of Data Entry Alternatives Pros and cons of alternatives for data entries: ❖ AltRecord : Entire file is stored as an index! If ❖ records are long, data entries of index are large and search time could be high AltRID and AltRIDlist : Data entries typically smaller ❖ than records; often faster for equality search AltRIDlist has more compact data entries than ❖ AltRID but entries are variable-length Q: A file can have at most one AltRecord index. Why? 10

More Indexing-related Terminology Composite Index: IndexKey has > 1 attributes ❖ Primary Index: IndexKey contains the primary key ❖ Secondary Index: Any index that not a primary index ❖ Unique Index: IndexKey contains a candidate key ❖ All primary indexes are unique indexes! ❖ MovieID Name Year Director IMDB_URL IMDB_URL is a Index on MovieID? candidate key Index on Year? Index on Director? Index on IMDB_URL? Index on (Year,Name)? 11

More Indexing-related Terminology Clustered index: order in which records are laid out is ❖ same as (or “very close to”) order of IndexKey domain Matters for (range) search performance! ❖ AltRecord implies index is clustered. Why? ❖ In practice, clustered almost always implies AltRecord ❖ In practice, a file is clustered on at most 1 IndexKey ❖ Unclustered index: an index that is not clustered ❖ MovieID Name Year Director IMDB_URL Index on Year? Index on (Year, Name)? 12

Indexing: Outline Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index ❖ 13

B+ Tree Index: Search Root Height = 1 30 13 17 24 Order = 2 33* 34* 38* 39* 3* 5* 19* 20* 22* 24* 27* 29* 2* 7* 14* 16* Given SearchKey k , start from root; compare k with ❖ IndexKeys in non-leaf/index entries; descend to correct child; keep descending like so till a leaf node is reached Comparison within non-leaf nodes: binary/linear search ❖ Examples : search 7*; 8*; 24*; range [19*,33*] 14

B+ Tree Index: Page Format Order = m/2 index entries Non-leaf P1 K 1 P2 K 2 Pm P m+1 P3 K m Page Pointer to a   Pointer to a   Pointer to a page Pointer to a page page with   page with with values s.t. with values s.t., values ≥ K m Values < K 1 K 1 ≤ Values < K 2 K 2 ≤ Values < K 3 data entries Leaf Page P0 R1 K 1 R 2 K2 P n+1 Rn K n Next Prev Page Page Pointer Pointer record 1 record 2 record n 15

B+ Trees in Practice Typical order value: 100 (so, non-leaf node can have up ❖ to 200 index keys) Typical occupancy: 67%; so, typical “ fanout ” = 133 ❖ Computing the tree’s capacity using fanout: ❖ Height 1 stores 133 leaf pages ❖ Height 4 store 133 4 = 312,900,700 leaf pages ❖ Typically, higher levels of B+Tree cached in buffer pool ❖ Level 0 (root) = 1 page = 8 KB ❖ Level 1 = 133 pages ~ 1 MB ❖ Level 2 = 17,689 pages ~ 138 MB and so on ❖ 16

B+ Tree Index: Insert Search for correct leaf L ❖ I nsert data entry into L ; if L has enough space, done! ❖ Otherwise , must split L (into new L and a new leaf L’) Redistribute entries evenly, copy up middle key ❖ Insert index entry pointing to L’ into parent of L ❖ A split might have to propagate upwards recursively : ❖ To split non-leaf node, redistribute entries evenly, but   ❖ push up the middle key (not copy up, as in leaf splits!) Splits “grow” the tree; root split increases height. ❖ Tree growth: gets wider or one level taller at top. ❖ 17

B+ Tree Index: Insert Example : Insert 8* Split! Height++ Root 30 13 17 24 33* 34* 38* 39* 3* 5* 19* 20* 22* 24* 27* 29* 2* 7* 14* 16* Split! Entry to be inserted in parent node   5 Copied up (and continues to appear in the leaf) 3* 5* 2* 7* 8* 18

B+ Tree Index: Insert Example : Insert 8* Insert in parent node. Pushed up (and only appears once in 17 the index) 5 13 24 30 Minimum occupancy is guaranteed in both leaf and non-leaf page splits 19

B+ Tree Index: Insert New Root Example : Insert 8* 17 24 5 13 30 33*34*38*39* 2* 3* 5* 7* 8* 19*20*22* 24* 27*29* 14*16* Recursive splitting went up to root; height went up by 1 ❖ Splitting is somewhat expensive; is it avoidable? ❖ Can redistribute data entries with left or right sibling, if ❖ there is space! 20

Insert: Leaf Node Redistribution Root Example : Insert 8* 30 8 13 17 24 33* 34* 38* 39* 3* 5* 19* 20* 22* 24* 27* 29* 2* 7* 14* 16* 8* 14* 16* Redistributing data entries with a sibling improves page ❖ occupancy at leaf level and avoids too many splits; but usually not used for non-leaf node splits Could increase I/O cost (checking siblings) ❖ Propagating internal splits is better amortization ❖ Pointer management headaches ❖ 21

B+ Tree Index: Delete Start at root, find leaf L where entry belongs ❖ Remove the entry; if L is at least half-full, done! Else, if ❖ L has only d-1 entries: Try to re-distribute , borrowing from sibling L’ ❖ If re-distribution fails, merge L and L’ into single leaf ❖ If merge occurred, must delete entry (pointing to L or ❖ sibling) from parent of L . A merge might have to propagate upwards recursively to ❖ root, which decreases height by 1 22

B+ Tree Index: Delete Example : Delete 22* Example : Delete 20* Root Example : Delete 24*? 17 24 5 13 27 30 2* 3* 33*34*38*39* 5* 7* 8* 19* 20*22* 24* 27*29* 24* 14*16* 27* 29* Deleting 22* is easy ❖ Deleting 20* is followed by redistribution at leaf level. ❖ Note how middle key is copied up . 23

B+ Tree Index: Delete Example : Delete 24* Need to merge recursively Must merge leaf ❖ 30 upwards! nodes! In non-leaf node, ❖ 33* 34* 38* 39* 19* 27* 29* remove index entry with key value = 27 Pull down of   ❖ New Root the index entry 5 13 17 30 3* 33* 34* 38* 39* 2* 5* 7* 8* 19* 27* 29* 14* 16* 24

Delete: Non-leaf Node Redistribution Suppose this is the state of the tree when deleting 24* ❖ Instead of merge of root’s children, we can also ❖ redistribute entry from left child of root to right child Root 22 30 17 20 5 13 2* 3* 5* 7* 8* 33*34*38* 39* 17*18* 20* 21* 22* 27*29* 14* 16* 25

CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing - PowerPoint PPT Presentation

CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing and Sorting Chapters 10, 11, and 13 of Cow Book Slide ACKs: Jignesh Patel, Paris Koutris 1 Motivation for Indexing Consider the following SQL query: Movies (M)

CSE 232A Graduate Database Systems Fall 2019 Arun Kumar 1 About Myself 2009: Bachelors in

CSE 232A Graduate Database Systems Arun Kumar Topic 1: Data Storage Chapters 8 and 9 of Cow

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

CSE 232A Graduate Database Systems Arun Kumar About Paper Reviews 1 Goal of Peer Review in

CSE 232A Graduate Database Systems Arun Kumar Review Discussion 1 Review Question Which

CSE 232A Graduate Database Systems Arun Kumar Topic 5: Data Integration and Cleaning Slide

CSE 232A Database System Implementation Arun Kumar Topic 8: Data Systems for ML Workloads

CSE 232A Database System Implementation Arun Kumar Topic 9: ML for RDBMSs Optional; this

CSE 132B CSE 132B Database Systems Applications Database Systems Applications Alin Deutsch

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Post Graduate Fellowships Types of Fellowships Graduate Study Post Graduate Travel Post

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

CSE 132B CSE 132B Database Systems Applications Database Systems Applications SQL as Query

The NSF Graduate The NSF Graduate The NSF Graduate The NSF Graduate Research Fellowship

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Measured Impact Webinar Follow-up Case Discussion Mrs. K. September 20, 2017 National Capacity

Expanding Query Answers on Medical Knowledge Bases Chuan Lei Vasilis Efthymiou Rebecca Geis

My ranting about mobile measurement Lin Zhong http://www.recg.org Managing participants was

Second Generation BTK Inhibitors Acalabrutinib (ACP-196) and Zanubrutinib (BGB-3111)

Clinician burnout: a hot topic and getting hotter. Are electronic medical records fuelling the

Bayesian Networks Representation Machine Learning 10701/15781 Carlos Guestrin Carnegie

Bayesian Networks George Konidaris gdk@cs.duke.edu Spring 2016 Recall Joint distributions:

Sambuz

Useful Links

Newsletter

Mail Us

CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing - PowerPoint PPT Presentation

CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing and Sorting Chapters 10, 11, and 13 of Cow Book Slide ACKs: Jignesh Patel, Paris Koutris 1 Motivation for Indexing Consider the following SQL query: Movies (M)

CSE 232A Graduate Database Systems Fall 2019 Arun Kumar 1 About Myself 2009: Bachelors in

CSE 232A Graduate Database Systems Arun Kumar Topic 1: Data Storage Chapters 8 and 9 of Cow

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

CSE 232A Graduate Database Systems Arun Kumar About Paper Reviews 1 Goal of Peer Review in

CSE 232A Graduate Database Systems Arun Kumar Review Discussion 1 Review Question Which

CSE 232A Graduate Database Systems Arun Kumar Topic 5: Data Integration and Cleaning Slide

CSE 232A Database System Implementation Arun Kumar Topic 8: Data Systems for ML Workloads

CSE 232A Database System Implementation Arun Kumar Topic 9: ML for RDBMSs Optional; this

CSE 132B CSE 132B Database Systems Applications Database Systems Applications Alin Deutsch

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Post Graduate Fellowships Types of Fellowships Graduate Study Post Graduate Travel Post

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

CSE 132B CSE 132B Database Systems Applications Database Systems Applications SQL as Query

The NSF Graduate The NSF Graduate The NSF Graduate The NSF Graduate Research Fellowship

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Measured Impact Webinar Follow-up Case Discussion Mrs. K. September 20, 2017 National Capacity

Expanding Query Answers on Medical Knowledge Bases Chuan Lei Vasilis Efthymiou Rebecca Geis

My ranting about mobile measurement Lin Zhong http://www.recg.org Managing participants was

Second Generation BTK Inhibitors Acalabrutinib (ACP-196) and Zanubrutinib (BGB-3111)

Clinician burnout: a hot topic and getting hotter. Are electronic medical records fuelling the

Bayesian Networks Representation Machine Learning 10701/15781 Carlos Guestrin Carnegie

Bayesian Networks George Konidaris gdk@cs.duke.edu Spring 2016 Recall Joint distributions:

Sambuz

Useful Links

Newsletter

Mail Us

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506: