Introduction to Data Indexing: Classifications and Properties Walid G. Aref Walid G. Aref
Introduction • The target of an index is to provide speedy retrieval of data from an underlying table • Input: One or more values (also, termed search key) or range of values • Output: Tuples from the index with these matching values (or within the range of values) • Search Key: One or more attributes from the schema of the underlying table Walid G. Aref
Primary vs. Secondary Index • Primary-key index : • Search key of the index is the primary key for the underlying table • Secondary-key index : • This is not the case, i.e., Search key of the index is not the primary key for the underlying table • Can have multiple tuples in the table with the same search key value • Example of a secondary-key index : • Index on GPA over the students table • Can have multiple students with the same GPA (for a given GPA value, multiple tids of students with the same GPA) • This is not the case for a primary-key index • Example of a primary-key index : • Index on Student-id over the students table • Has only one tuple for a given sid (for a given sid value, only one tid has that value) Walid G. Aref
How are the Index and the Underlying Table Related? ? Table • What to store in the index? Walid G. Aref
What Gets Stored in the Index? • Three choices on what to store in the index 1. The entire table (Index-based Tables) 2. (Key value, Tuple-identifier) pairs 3. Key value, Set of Tuple-identifiers ? Table Walid G. Aref
Storing the Entire Table in the Index: Index-based Tables Table • The entire table is stored inside an index • Common: Hash and B-tree tables Search Path • Is used typically when the Search Key for the index is the same as the primary key for table Tree-based Index • Or at least is a unique key • Why? So that tuples are stored only once inside the index • B-tree-based table: • Search key of the b-tree is a unique key for the table. Table • Need log time to access a tuple Stored Hash Table • Hash-based table: Inside Hash-based Index the • Buckets collectively contain the entire table Hash • Hash function takes the key of the tuple as input and produces the bucket (disk page) that contains the entire tuple Buckets Walid G. Aref
Index-based Tables Table • While table can conceptually have multiple indexes, the table can only be stored in one Search Path index. Tree-based Index • Thus, can have only one index-based table and multiple other indexes from the other choices Table Stored Hash Table Inside Hash-based Index the Hash Buckets Walid G. Aref
Choice 2: The Index Contains (Key Value, Tuple-Identifier) Pairs • Tuple identifiers point to the tuples • Search key does not have to be the key for the table • Can be a secondary index • e.g., index on salary Table Search Path Tree-based Index Leaf Level contains (value, Tid) pairs Walid G. Aref
Choice 2: The Index Contains (Key Value, Tuple-Identifier) Pairs • What if the table is sorted by the same attribute as the Search Key? • Comb-like connections between the index and the table • Why is the comb-like shape important? Table • Because it is an indication of a clustered index Search Path Tree-based Index Leaf Level contains Table is sorted based on the (value, Tid) pairs same attribute as that of Walid G. Aref the index
Clustered vs. Un-clustered Index Table • Clustered index : Table is sorted based on the same attribute as that of the index Search Range • Un-clustered index : Table is sorted based on a different attribute than that of the index (or Tree-based Index not sorted at all) • In a Range Search operation, tuples in the range will end up being contiguous in the case of a clustered index • Reduces the I/Os significantly • Example: • Assume Disk page capacity DC, Number of tuples in the range is Nr Table • Number of I/Os • Clustered index: ⌈ !" #$ ⌉ , Un-clustered index: ~ Nr Search Range Tree-based Index Walid G. Aref
Choice 3: The Index Contains a Key Value and Set of Tuple-identifiers • Leaf level contains • (key value, tid1, tid2, ..) • Suitable for a secondary key index (e.g., index on GPA) where values are repeated. • Can save in storage Table • However, need support for variable- length records Search Path • In contrast to (key-value, tid) fixed- length records (in most cases, e.g., an Tree-based Index exception in when the key-value is a string) Leaf Level contains (value, Set of tids) For tuples with that Walid G. Aref value
Dense vs. Sparse Indexes • When table is sorted on the same key as the index, we have Table this comb-like connection between index and table • Can habe the index point to only the first tuple in the page. (the rest are sorted. • Gives rise to sparse vs. dense indexes Tree-based Dense Index • Dense index : Entry in the index per tuple in the table • Sparse index : Entry in the index per page in the table • Result in a smaller-size index. Table Table Tree-based Tree-based Sparse Index assuming Sparse Index assuming Walid G. Aref Page stores 3 tuples Page stores 3 tuples
Un-clustered Indexes Table • Un-clustered index : Table is sorted based on a different attribute than that of the index (or Search Range not sorted at all) • In a Range Search operation, tuples in the An Unclustered range will end up being contiguous in the Tree-based Index case of a clustered index • Reduces the I/Os significantly • In-evitable problem: Why? • Because for a given table, we can have only one clustered index. All the other indexes will have to be un-clustered Table • Question: • How to deal with the cost of query processing for un-clustered index? Search Range A Clustered Walid G. Aref Tree-based Index
Efficient Range Search using Un-clustered Indexes Table • In a Range Search operation, tuples in the range will end up being in separate pages Search Range • Cost formula: • Clustered index: ⌈ !" #$ ⌉ I/O An Unclustered • Un-clustered index: ~ Nr Tree-based Index • We can have only one clustered index per table. All the other indexes will have to be un-clustered • How to deal with the cost of query processing for un-clustered index? • If the number of tuples qualifying a range is small , e.g., 1, do nothing • For a one-tuple result, Clustered and un-clustered costs are the same (1 I/O) • Otherwise, collect the set of tids that qualify the range query without retrieving the tuples • Tid = (Page-id, slot-id) • Sort the tids based on Page-id • Retrieve a page only once and retrieve all the qualifying tuples from their corresponding slot-ids • Can result in some saving Walid G. Aref
Multi-Dimensional Indexes (or Indexes with Composite Attributes) • Question: Can we realize an index when the search key is more than one attribute? • Give rise to Composite index or Multi-dimensional Index sid cid grade • Example: For Table Enrolled (sid, cid, grade) 0111 CS541 A • Index on (sid,cid): 0111 CS580 B • Can answer queries on sid,cid: 0333 CS448 A- Find grade of sid=0111 in cid=CS580 0444 CS348 B • May also answer queries on sid only: Find courses taken by sid=0111 • Can it answer queries on cid? E.g., Find students registered in cid=CS580? 0333 CS580 A • Depends on how the composite index is constructed Enrolled Table • How to realize a composite index? Walid G. Aref
How to Realize Composite or Multi- Dimensional Indexes? • Want to build an index on Attributes A and B • Many ways to do that. Each with different costs and query capabilities sid cid grade • Simple solution (for now - More on this subject later): 0111 CS541 A • Concatenate the two attribute values together and build a hash or tree- based index 0111 CS580 B • For example, concatenate sid,cid 0333 CS448 A- • Predicates that can be answered “efficiently” when 0444 CS348 B concatenating AB 0333 CS580 A • Tree-based index on AB : Equality on AB, Equality on A only, Range Enrolled Table predicate on A and Range predicate on B, Range on only A • Hash-based index on AB : Only equality on AB unless hash supports order-preserving property (can support ranges on A and B or on A only Walid G. Aref
Composite Indexes vs. One-attribute Indexes • Question: What are the advantages/disadvantages of Composite vs. One-attribute indexes • Assume we have a table with two attributes A and B sid cid grade • Scenario 1: Build a composite index on AB (the concatenation of A and B 0111 CS541 A 0111 CS580 B • Scenario 2: Build two single-attribute indexes; one on A and one on B 0333 CS448 A- 0444 CS348 B • Consider the query: 0333 CS580 A • Find grade of sid=0111 in cid=CS580 • For Scenario 1, use the composite index on sid,cid Enrolled Table • For Scenario 2, we have multiple strategies to answer the query Walid G. Aref
Recommend
More recommend